Scraping Condo Listings from Homes.com Using Playwright: A Complete Engineering Guide
- Anusha P O
- 6 minutes ago
- 19 min read

Is buying a condo in California really more expensive than owning one in New York, or does the story change once the numbers are laid out side by side?
To explore this question, real condo listings were carefully scraped from Homes.com, with California and New York collected separately to keep the comparison clear and fair, using publicly available listing pages such as the California condos section on Homes. Rather than relying on assumptions or headlines, this approach builds a solid data foundation straight from the source, making it easier to understand how these two iconic housing markets differ in cost, scale, and overall buying pressure. The sections that follow document this data scraping journey step by step, showing how structured code can quietly turn thousands of online listings into meaningful, ready-to-analyze housing data.
Scraping Condo Listings from Homes.com Using Playwright
Here is the step-by-step step guide on Scraping Condo Listings from Homes.com Using Playwright
STEP 1 : URL Collection from Condo Listings
When working with Homes.com , the first real challenge is not extracting prices or property details, but reliably finding every individual property URL spread across many result pages. Homes.com loads listings dynamically and divides them across multiple pages, which means important links are not always visible at once and can easily be missed if the page is rushed. To solve this, a structured scraping flow was built using Python and browser automation tools to carefully move through the listings, starting from the California condos page and then repeating the same process separately for New York. The scraper opens the site like a normal visitor, waits for listings to load, scans each page for property cards, and collects only clean, valid links before moving to the next page. Small pauses are added between actions so the browsing pattern looks natural, while logging quietly records progress in the background. Instead of saving links in loose files, California condo URLs are stored in one database table and New York URLs in another, keeping both regions clearly separated and easy to manage later. This step works much like writing down the addresses of houses in two different cities before visiting them—slow, careful, and organized—ensuring that the foundation is solid before moving on to deeper property-level scraping.
STEP 2 : Structured Data Extraction from Individual Property Pages
Once the list of property URLs was safely stored in the database, the next step was to visit each link one by one and gently extract the details that truly describe a condo listing. The property pages often load important information only after the page settles, so each URL was opened carefully, allowing the content to fully appear before anything was read. The scraper fetched the page, parsed the HTML, and then looked for clear, reliable markers to capture essentials such as the property image, listed price, and address, similar to slowly reading a property brochure instead of skimming it. Each extracted record was saved back into a structured database table, with California listings written to one table and New York listings to another, keeping both markets cleanly separated and easy to manage. This stage completes the journey that began with URL collection, transforming simple links into meaningful, organized housing data that is ready for comparison and deeper understanding.
STEP 3 : Data Cleaning with OpenRefine
After scraping condo listings from Homes.com and storing California and New York data in separate tables, the next important step was cleaning the raw data so it could be read, compared, and analyzed with confidence. The scraped files were uploaded into OpenRefine, a tool designed to make large datasets easier to inspect and fix, where missing values were first standardized by filling empty fields with clear markers like “N/A” to avoid confusion later. Duplicate property URLs, which can quietly appear during repeated scraping or pagination, were identified and removed to ensure each condo appeared only once. Price fields often contained currency symbols and formatting that looked fine to the eye but caused issues during analysis, so these symbols were carefully removed and the values normalized into a consistent numeric format. To improve readability, some combined fields were split into separate columns, making details like location and pricing clearer at a glance. This cleaning stage acts like tidying handwritten notes into a neat spreadsheet, turning raw scraped content into a reliable and well-structured dataset that is ready for meaningful comparison between California and New York housing markets.
Advanced Tool-sets and Libraries for Efficient Data Extraction
This project brings together a carefully chosen set of Python libraries that make the process of scraping condo listings from Homes.com clear and reliable, even when California and New York data are handled separately. Playwright, accessed through playwright.sync_api, is responsible for opening real browser pages and navigating listings the same way a human would, while playwright_stealth quietly adjusts browser signals so the activity looks natural rather than automated. Alongside browser automation, requests is used for fetching individual pages through an API-based approach when needed, and BeautifulSoup helps gently read the HTML and pull out meaningful details like prices, images, and addresses, similar to carefully scanning a printed brochure for key information. Data handling is kept simple and lightweight using sqlite3 for storing URLs and tracking progress locally, and json for exporting clean, shareable results. Supporting libraries such as time and random introduce small pauses and variation between actions to keep browsing behavior realistic, while logging acts like a detailed activity journal that records what happens at each step, making errors easier to understand and fix. Utilities like Path from pathlib help manage files and folders cleanly, and List from typing improves code clarity without adding complexity. Together, these tools form a smooth workflow where browsing, data extraction, storage, and monitoring flow naturally, making the entire scraping process approachable and easy to follow for interns or beginners exploring real-world web data collection for the first time.
STEP 1: Extracting Condo Listing URLs from the Homes.com
Importing Libraries
Imports
import time
import random
import json
import sqlite3
import logging
from pathlib import Path
from typing import List
from playwright.sync_api import sync_playwright, Page
from playwright_stealth import stealth_syncThe code begins by importing a small set of essential Python libraries that work together to make the scraping process smooth and reliable. Playwright is used to open and interact with dynamic pages on Homes.com, playwright_stealth helps reduce bot detection, while standard libraries like json, sqlite3, and logging handle data storage, structure, and execution tracking so the California and New York condo data can be collected and saved in an organized way.
Setting the Starting Point and Storage Paths
Configuration
START_URL = "https://www.homes.com/california/condos-for-sale/"
BASE_URL = "https://www.homes.com"
USER_AGENTS_FILE = "/home/anusha/Desktop/DATAHUT/chewy-and-petco/user_agents.txt"
SQLITE_DB = "/home/anusha/Desktop/DATAHUT/condos/Data/homes_urls_california.db"
JSON_OUT = "/home/anusha/Desktop/DATAHUT/condos/Data/homes_urls_california.json"
LOG_FILE = "/home/anusha/Desktop/DATAHUT/condos/Log/homes_scraper_california.log"
This configuration section defines the foundation for scraping condo listings from Homes.com in a clear and controlled way, starting with the California condos page using the START_URL, while the BASE_URL helps convert relative links into complete, usable URLs as the scraper moves through multiple pages. To reduce blocking and appear more like a real browser, a list of rotating user agents is loaded from an external file, a common and beginner-friendly practice explained in many web scraping guides such as the Playwright documentation. The scraped condo URLs are stored safely in both a SQLite database and a JSON file, making the data easy to reuse later for analysis or sharing, while a dedicated log file records each step of the process so errors or interruptions can be traced and fixed without guesswork. By keeping California and New York data in separate configuration setups, the scraping process remains organized, scalable, and easy to understand, much like keeping files in clearly labeled folders rather than mixing everything together.
Using Browser-Like Headers to Avoid Blocking
Headers to mimic a real browser
EXTRA_HEADERS = {
"Content-Type": "text/html; charset=utf-8",
"x-frame-options": "SAMEORIGIN",
"x-content-type-options": "nosniff",
"Content-Security-Policy": "frame-ancestors 'self' https://*.homes.com https://auth.homes.com;",
"Permissions-Policy": "browsing-topics=()",
}
This section adds extra request headers that help the scraper behave more like a real web browser when visiting Homes.com, which is especially important when collecting condo listings separately from California and New York. Headers such as Content-Type and security-related fields quietly tell the website how the page should be handled, reducing the chances of the request being flagged as automated and improving page stability during loading. Using headers in this way is a widely accepted best practice in responsible web scraping. By keeping these settings simple and consistent, the scraper can access listing pages more smoothly while staying aligned with standard web behavior.
Managing Cookies for Stable Page Access
Cookies
COOKIES = [
{"name": "sb", "value": "%255b%257b%2522pt%2522%253a4%252c%2522gk%2522%253a%257b%2522key%2522%253a%25227969zft86ltlw%2522%257d%252c%2522gt%2522%253a1%257d%255d",
"domain": "www.homes.com", "path": "/", "httpOnly": True, "secure": True},
{"name": "ls", "value": "%2Fcalifornia%2Fcondos-for-sale%2F", "domain": "www.homes.com", "path": "/", "httpOnly": True, "secure": True},
]This cookies section helps the scraper maintain a steady and predictable browsing session while visiting condo listings on Homes.com, whether the data comes from California or New York, by storing small pieces of information that the website normally saves in a real user’s browser. In simple terms, cookies act like a bookmark and preference note combined, allowing the site to remember visited paths and session details so pages load correctly without repeated interruptions or redirects. By predefining these cookies, the scraper avoids unnecessary reloads and behaves more like a regular visitor calmly browsing listings, rather than repeatedly knocking on the door as a new guest each time.
Adding Delays for Human-Like Browsing
Delay between page loads
DELAY_MIN = 1.0
DELAY_MAX = 4.0
MAX_PAGES = None # Set limit if needed
This section introduces a small random delay between page loads so the scraper moves through website at a calm, natural pace, similar to how a real person would browse condo listings in different locations. By waiting a short, varied amount of time between requests, the process becomes more stable and respectful to the website.
Logging: Keeping Track of Every Step
Logging setup
def setup_logger():
"""Sets up a logger to record all scraper activity"""
log_dir = Path(LOG_FILE).parent
log_dir.mkdir(parents=True, exist_ok=True)
logger = logging.getLogger("homes_scraper_california")
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s [%(levelname)s] %(message)s", "%Y-%m-%d %H:%M:%S")
file_handler = logging.FileHandler(LOG_FILE, mode="w", encoding="utf-8")
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(formatter)
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
console_handler.setFormatter(formatter)
logger.addHandler(file_handler)
logger.addHandler(console_handler)
return logger
logger = setup_logger()
This logging setup creates a clear activity record for the Homes.com condo scraper, making it easy to understand what happens while collecting listings from California and New York, especially when the two regions are handled separately. The logger acts like a travel journal for the scraper, writing detailed messages to a log file for later review while also showing important updates on the screen, and commonly used in real-world scraping projects to quickly spot errors, interruptions, or unexpected page behavior without guessing what went wrong.
Utility Function for Loading User Agents
Utilities
def read_user_agents(path: str) -> List[str]:
"""Reads user-agent strings from a text file"""
p = Path(path)
if not p.exists():
logger.error(f"User agents file not found: {path}")
raise FileNotFoundError(f"User agents file not found: {path}")
lines = [l.strip() for l in p.read_text(encoding="utf-8").splitlines() if l.strip()]
logger.debug(f"Loaded {len(lines)} user agents.")
return lines
This utility function safely reads browser user-agent strings from an external text file and prepares them for use during scraping, which helps the Homes.com condo crawler appear more like a real visitor while collecting listings separately from California and New York. By checking whether the file exists, logging useful messages, and returning only clean, non-empty entries, the function adds a small but important layer of reliability.
Initializing the SQLite Database for Storing URLs
Initialize SQLite DB and create table if not exists
def init_db(db_path: str):
"""Creates (or connects to) an SQLite database and sets up the 'urls' table"""
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL UNIQUE,
inserted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
logger.debug(f"Initialized database at {db_path}")
return connThis database setup function creates a simple and reliable place to store condo listing URLs, keeping data neatly organized even when they are scraped separately. Using SQLite allows the scraper to save each unique URL with a timestamp, much like writing entries into a small digital notebook that automatically avoids duplicates, making the data easy to manage and revisit later without additional complexity.
Saving and Deduplicating Property URLs
Save a URL to the database, return True if new, False if duplicate
def save_url_to_db(conn: sqlite3.Connection, url: str) -> bool:
"""Inserts a property URL into the database if it’s new"""
try:
cur = conn.cursor()
cur.execute("INSERT INTO urls (url) VALUES (?)", (url,))
conn.commit()
logger.debug(f"Inserted new URL: {url}")
return True
except sqlite3.IntegrityError:
logger.debug(f"Duplicate URL skipped: {url}")
return False
This function handles the careful task of saving each URL into the SQLite database while automatically skipping duplicates, which is especially helpful when collecting data separately for California and New York. It works like a checklist that marks a URL only once—new links are stored and logged, while repeated ones are quietly ignored— ensuring the dataset stays clean, accurate, and easy to work with as scraping progresses.
Exporting Collected URLs to a JSON File
Dump all URLs from DB to JSON file
def dump_db_to_json(conn: sqlite3.Connection, out_path: str):
"""Exports all stored URLs from the database into a JSON file"""
cur = conn.cursor()
cur.execute("SELECT url FROM urls ORDER BY id")
rows = [r[0] for r in cur.fetchall()]
with open(out_path, "w", encoding="utf-8") as f:
json.dump(rows, f, indent=2)
logger.info(f"Wrote {len(rows)} unique URLs to {out_path}")This function takes all the condo listing URLs stored in the SQLite database and neatly exports them into a JSON file, creating a clean and portable output that is easy to read, share, or use in later analysis. It acts like packing a well-organized folder from a notebook into a digital file, ensuring the final scraped data is structured, and reusable.
Extracting Property Links from Each Listings Page
Scraper
def extract_property_urls(page: Page) -> List[str]:
"""Extracts all property listing URLs from the current page"""
script = """
() => {
const nodes = Array.from(document.querySelectorAll('div.description-container'));
const urls = [];
for (const el of nodes) {
let a = el.closest('a');
if (!a) a = el.querySelector('a') || el.parentElement?.querySelector('a');
if (a && a.href) urls.push(a.href);
}
return Array.from(new Set(urls));
}
"""
try:
urls = page.evaluate(script)
except Exception as e:
logger.exception("evaluate failed: %s", e)
urls = []
normalized = [BASE_URL.rstrip("/") + u if u.startswith("/") else u for u in urls]
return normalized
This scraper function focuses on carefully finding and collecting individual condo listing links by looking for the page sections that describe each property and then tracing them back to their clickable links. I scans the page the same way a person’s eyes would move across listing cards, gathers only valid and unique URLs, fixes any partial links using the base website address, and safely handles errors if the page structure changes.
Moving Through Pagination to Reach More Listings
Navigate to the next page if available
def click_next(page: Page, current_page: int) -> bool:
"""The function looks for the 'Next Page' button or link using its page number"""
try:
logger.info(f"Looking for next page link (current page {current_page})...")
next_link = page.query_selector(f'a.text-only[title="Page {current_page + 1}"], a[data-page="{current_page + 1}"]')
if not next_link:
logger.info("No next page link found.")
return False
href = next_link.get_attribute("href")
if not href:
logger.info("Next link has no href.")
return False
next_url = BASE_URL.rstrip("/") + href
logger.info(f"Navigating to next page: {next_url}")
page.goto(next_url, timeout=60000)
page.wait_for_selector("div.description-container", timeout=15000)
return True
except Exception as e:
logger.warning(f"Failed to navigate next page: {e}")
return False
This pagination function helps the scraper move smoothly from one results page to the next by looking for the link that points to the following page number and navigating to it when available, which is essential when collecting condo listings spread across multiple pages. It works like turning the next page of a catalog—checking that the page exists, opening it safely, and waiting until the listings load—while logging each step and handling errors gracefully.
Starting the Scraper and Preparing the Browser
Main Scraper
def run_scraper():
"""The main function that runs the entire scraping process from start to finish"""
user_agents = read_user_agents(USER_AGENTS_FILE)
conn = init_db(SQLITE_DB)
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, args=["--start-maximized"])
chosen_ua = random.choice(user_agents)
logger.info(f"Using User-Agent: {chosen_ua}")
context = browser.new_context(
user_agent=chosen_ua,
viewport={"width": 1366, "height": 768},
extra_http_headers=EXTRA_HEADERS,
)
This first part sets the stage for the entire scraping process by loading the list of user agents, connecting to the SQLite database, and launching a real browser using Playwright. A random user agent is selected to make each run look like a normal browsing session, similar to how different people use different devices or browsers, which helps reduce blocking. The browser context is then created with a standard screen size and custom headers so the Homes.com pages load as expected, forming a stable base before visiting the condo listings.
try:
context.add_cookies([{**c, "url": BASE_URL} for c in COOKIES])
except Exception:
logger.warning("Failed to add cookies; continuing without them.")
page = context.new_page()
stealth_sync(page)
logger.info(f"Opening start URL: {START_URL}")
page.goto(START_URL, timeout=60000)
time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))
This section focuses on opening the Homes.com listing page in a calm and realistic way by first adding cookies, if possible, and then enabling stealth mode to reduce automated detection. Cookies help the site remember session details, while stealth settings adjust small browser signals so the page behaves as if a real person is browsing. After opening the California condos start URL, the scraper pauses briefly using a random delay, much like a person taking a moment to read the page before scrolling.
page_num, total_new = 1, 0
while True:
logger.info(f"Scraping page {page_num}")
try:
page.wait_for_selector("div.description-container", timeout=15000)
except Exception:
logger.warning("Listings not detected quickly; proceeding anyway.")
urls = extract_property_urls(page)
new_count = sum(save_url_to_db(conn, u) for u in urls)
total_new += new_count
logger.info(f"Page {page_num}: Found {len(urls)} URLs ({new_count} new)")Once the page is loaded, this part handles the core task of finding condos links on each results page and saving them safely into the database. The scraper waits for listing containers to appear, extracts all property URLs, and stores only new ones while skipping duplicates, keeping the dataset clean and organized. Each page is logged with the number of links found, which helps track progress clearly when scraping large regions like California or New York.
if MAX_PAGES and page_num >= MAX_PAGES:
logger.info("Reached max pages limit. Stopping.")
break
if not click_next(page, page_num):
logger.info("No more pages. Scraping complete.")
break
time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))
page.wait_for_load_state("networkidle", timeout=20000)
time.sleep(random.uniform(1, 2))
page_num += 1
dump_db_to_json(conn, JSON_OUT)
logger.info(f"Scraping finished: {page_num} pages, {total_new} new URLs collected.")
context.close()
browser.close()
conn.close()
The final part controls how the scraper moves across multiple pages and knows when to stop, either by reaching a page limit or when no next page is available. After each successful page turn, the scraper waits for the network to settle and adds short pauses to keep the browsing pattern natural and stable. When all pages are completed, the collected URLs are exported into a JSON file, the browser and database connections are closed properly, and a final summary is written to the logs.
Script Entry Point and Safe Execution
Entry point
if __name__ == "__main__":
"""Runs the scraper when this script is executed directly"""
try:
run_scraper()
except Exception as e:
logger.exception(f"Fatal error: {e}")
This entry point ensures the scraper runs only when the file is executed directly, acting like a clear “start here” sign for the program while keeping the code safe if it is imported elsewhere. By wrapping the main scraping logic in a try–except block, any unexpected issue during the Homes.com condo collection for California or New York is captured and written to the logs instead of stopping silently, making the overall workflow easier to understand and more reliable.
Step 2: Extracting Complete Property Details from Individual Homes.com Listings
Importing Libraries
Imports
import sqlite3
import requests
import random
import time
import logging
from bs4 import BeautifulSoup
import jsonDefining Paths for Data, Logs, and Browser Identity
Configurations
DB_PATH = "/home/anusha/Desktop/DATAHUT/condos/Data/homes_urls_california.db"
USER_AGENTS_FILE = "/home/anusha/Desktop/DATAHUT/chewy-and-petco/user_agents.txt"
LOG_FILE = "/home/anusha/Desktop/DATAHUT/condos/Log/homes_scraper.log"
This configuration block sets clear file paths that tell the scraper where to store collected condo URLs, where to read browser user-agent details, and where to write log messages while visiting Homes.com listings for California or New York. It is like assigning fixed shelves for notes, tools, and progress records.
Managing the Scraper API Key Securely
SCRAPERAPI_KEY = "********************686"This configuration line represents the API key used to route requests through a scraping service, which helps access Homes.com listings more reliably when collecting data. This key can be thought of as an access pass that allows the scraper to use an external helper service, and best practice is to keep it private and load it from environment variables instead of writing it directly in the code.
Basic Logging Configuration for Clear Progress Tracking
Logging setup
logging.basicConfig(
filename=LOG_FILE,
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
"""This section sets up all the basic configurations needed for the scraper to run"""This logging setup defines how the scraper records its activity while collecting condo listings, by saving clear, time-stamped messages into a log file. In simple terms, it works like a running diary that notes what happened and when, making it much easier to understand the scraper’s behavior or diagnose issues later.
Using Browser-Like Headers for Natural Page Requests
Headers to mimic a real browser
EXTRA_HEADERS = {
"User-Agent": "", # will be overwritten
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5"
}Cookies for Maintaining a Stable Browsing Session
Cookies
COOKIES = [
{"name": "sb", "value": "%255b%257b%2522pt%2522%253a4%252c%2522gk%2522%253a%257b%2522key%2522%253a%25227969zft86ltlw%2522%257d%252c%2522gt%2522%253a1%257d%255d",
"domain": "www.homes.com", "path": "/", "httpOnly": True, "secure": True},
{"name": "ls", "value": "%2Fcalifornia%2Fcondos-for-sale%2F", "domain": "www.homes.com", "path": "/", "httpOnly": True, "secure": True},
]Selecting a Random User-Agent for Natural Browsing
Helper function to get a random User-Agent
def get_random_user_agent():
with open(USER_AGENTS_FILE, 'r') as f:
user_agents = [line.strip() for line in f if line.strip()]
return random.choice(user_agents)This helper function simply reads a list of browser user-agent strings from a text file and randomly picks one each time the scraper runs, helping the scraper appear like a regular visitor rather than an automated script when accessing California or New York listings. In easy terms, it is similar to changing the type of browser being used on each visit, which helps pages load more smoothly and reduces unnecessary blocking.
Handling Requests Safely with Retries and Smart Delays
Robust request function with retries and backoff
def make_request(url, retries=2, timeout=60):
for attempt in range(1, retries + 1):
try:
user_agent = get_random_user_agent()
headers = EXTRA_HEADERS.copy()
headers['User-Agent'] = user_agent
params = {
'api_key': SCRAPERAPI_KEY,
'url': url,
'country_code': 'us',
'device_type': 'desktop'
}
response = requests.get("https://api.scraperapi.com/", params=params, headers=headers, timeout=timeout)
if response.status_code == 200:
return response.text
else:
logging.warning(f"Request failed with status {response.status_code} for {url}. Attempt {attempt}/{retries}")
except requests.exceptions.RequestException as e:
logging.warning(f"Request exception for {url} (Attempt {attempt}/{retries}): {e}")
# Exponential backoff with jitter
sleep_time = min(10, 2 ** attempt + random.random())
time.sleep(sleep_time)
logging.error(f"All retries failed for {url}")
return None
This function is designed to fetch pages in a steady and reliable way, even when the network is slow or the site temporarily refuses a request, which can happen when collecting condo data separately for large regions. Each request is sent with a randomly chosen browser identity and helpful headers, routed through a scraping service, and if something goes wrong, the function patiently retries after waiting for a gradually increasing amount of time, similar to taking short breaks before trying again rather than rushing. This retry-and-wait approach, often called backoff, helps avoid unnecessary failures , making the scraping process calmer, more resilient, and easier to understand for those new to web automation.
Parsing Individual Property Pages into Structured Data
Parse property page HTML to extract data
def parse_property_page(html, url):
soup = BeautifulSoup(html, 'html.parser')
data = {
"url": url,
"image_url": None,
"price": None,
"address": None
}
img_tag = soup.select_one('figure#carousel-primary-photo img')
if img_tag:
data['image_url'] = img_tag['src'] if img_tag else None
price_tag = soup.select_one('span.property-info-price')
if price_tag:
data['price'] = price_tag.get_text(strip=True) if price_tag else None
address_tag = soup.select_one('div.property-info-address')
if address_tag:
data['address'] = ' '.join(address_tag.stripped_strings) if address_tag else None
return dataThis parsing function takes the raw HTML of a single Homes.com property page and gently turns it into clean, readable information by locating key elements such as the main image, listed price, and property address. In simple terms, it works like carefully reading a flyer and picking out only the important details, using clear HTML markers to avoid confusion if the page layout changes slightly. By returning the data in a simple dictionary, the scraper keeps California and New York condo details well organized and ready for further use or analysis without adding unnecessary complexity.
Creating a Table to Store Scraped Property Details
Database setup
def create_table_if_not_exists(conn):
conn.execute('''
CREATE TABLE IF NOT EXISTS scraped_data (
url TEXT PRIMARY KEY,
image_url TEXT,
price TEXT,
address TEXT
)
''')
conn.commit()This database setup function prepares a simple table where each scraped Homes.com property can be saved in an organized and reliable way, whether the data comes from California or New York. In everyday terms, it creates a structured notebook with fixed columns for the property link, image, price, and address, and uses the URL as a unique key so the same listing is not stored twice. By setting up the table only if it does not already exist, the scraper can be run multiple times without breaking earlier data, keeping the overall workflow clean and easy to understand.
Tracking Progress with a processed Column
Add 'processed' column if not exists
def add_processed_column_if_not_exists(conn):
cur = conn.cursor()
cur.execute("PRAGMA table_info(urls)")
columns = [col[1] for col in cur.fetchall()]
if 'processed' not in columns:
cur.execute("ALTER TABLE urls ADD COLUMN processed INTEGER DEFAULT 0")
conn.commit()This small helper function prepares the database to track scraping progress by checking whether a processed column already exists in the URLs table and adding it only if it is missing. In simple terms, this column works like a checklist mark, showing whether a condo listing from Homes.com has already been handled or still needs attention, which is especially useful when California and New York data are scraped separately or when a long run stops midway.
Main Loop for Managing Unprocessed URLs
Main loop
def main():
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
add_processed_column_if_not_exists(conn)
create_table_if_not_exists(conn)
cursor.execute("SELECT url FROM urls WHERE processed=0")
urls = [row[0] for row in cursor.fetchall()]
logging.info(f"Found {len(urls)} unprocessed URLs")
for idx, url in enumerate(urls, 1):
logging.info(f"Scraping ({idx}/{len(urls)}): {url}")
html = make_request(url)
if not html:
logging.warning(f"Failed to get HTML for {url}")
continue
This main loop acts as the control center of the scraper, opening the database, preparing required tables, and identifying which Homes.com condo URLs still need to be processed, a step that is especially useful when California and New York listings are handled separately. In simple terms, it reads a to-do list from the database, logs how many property pages are pending, and then visits each URL one by one, carefully requesting the page content and skipping over any link that fails to load. This structured flow makes long scraping runs easier to pause and resume.
try:
data = parse_property_page(html, url)
cursor.execute("""
INSERT OR IGNORE INTO scraped_data (url, image_url, price, address)
VALUES (?, ?, ?, ?)
""", (
data["url"],
data["image_url"],
data["price"],
data["address"]
))
cursor.execute(
"UPDATE urls SET processed=1 WHERE url=?",
(url,)
)
conn.commit()
except Exception as e:
logging.error(f"Error parsing or inserting data for {url}: {e}")
time.sleep(random.uniform(1, 5))
conn.close()
logging.info("Scraping completed.")
Once a property page is successfully loaded, this part focuses on turning the page into useful data and storing it safely, while clearly marking progress in the database. The scraper extracts key details like the image, price, and address, saves them into a separate table, and then updates the original URL record to show that it has been processed, much like ticking off a completed task on a checklist. Short pauses between each request keep the browsing pattern natural and stable, and any errors are logged without stopping the entire run.
Entry Point: Starting the Scraper Safely
Entry point
if __name__ == "__main__":
main()This entry point tells Python to start the scraping process only when the script is run directly, acting like a clear “start button” for the program. By calling the main() function here, the code stays organized and avoids running unintentionally if the file is reused elsewhere.
Conclusion
By the end of this workflow, the entire scraping journey feels less like a collection of isolated code blocks and more like a smooth, well-guided process. Each part—starting from discovering listing pages, moving through individual property details, and finally storing clean, usable data—fits together naturally, much like following a clear route on a map rather than guessing directions along the way. The careful use of delays, logging, progress tracking, and simple storage ensures that the scraper remains steady even when handling large regions like California and New York separately. For anyone stepping into real-world data collection for the first time, this approach offers a reassuring reminder that with the right structure and patience, complex-looking tasks can be broken down into manageable, understandable steps that quietly do their job in the background.
Libraries and Versions Used
Name: random
Version: Built-in Python module
Name: sqlite3
Version: Built-in Python module
Name: json
Version: Built-in Python module
Name: logging
Version: Built-in Python module
Name: BeautifulSoup (bs4)
Version: 4.12.3
Name: playwright
Version: 1.48.0
Name: playwright-stealth
Version: 1.0.6
AUTHOR
I’m Anusha P O, a Data Science Intern at Datahut, with hands-on experience in building reliable and scalable web-scraping workflows. In this blog, the focus is on extracting structured condo listing data from Homes.com, where California and New York listings were scraped separately and organized into clean, usable datasets. The walkthrough covers how dynamic listing pages were navigated using Playwright, how property URLs and details were stored safely in databases, and how raw web content was cleaned and refined into analysis-ready data using tools like SQLite, JSON, and OpenRefine.
At Datahut, the work centers on helping businesses turn public web data into meaningful intelligence for market research, pricing analysis, and location-based insights. If there is interest in understanding real-estate trends, building dependable data pipelines, or working with large, unstructured web datasets, feel free to connect through the chat widget. Raw listings become far more valuable when they are transformed into clear, structured insights that support confident decision-making.
Frequently Asked Questions (FAQs)
1. Why use Playwright for scraping condo listings from Homes.com?Playwright is ideal for scraping modern real estate websites because it can handle JavaScript-heavy pages, dynamic content loading, pagination, and user interactions like filters and scrolling. This ensures accurate extraction of listing details that may not appear in the initial HTML.
2. What type of data can be extracted from Homes.com condo listings?You can extract property titles, prices, locations, number of bedrooms and bathrooms, square footage, listing agents, property descriptions, images, and availability status. This data is useful for market analysis, price benchmarking, and real estate trend tracking.
3. How do you handle pagination and dynamic loading on Homes.com?Pagination and dynamic loading can be handled in Playwright by simulating user actions such as clicking “Next,” scrolling to trigger lazy loading, and waiting for network requests or DOM elements to load before extracting data.
4. Is it legal to scrape real estate listing websites like Homes.com?Scraping is generally allowed when done responsibly, respecting the website’s terms of service, robots.txt guidelines, and applicable laws. It’s important to avoid excessive requests, bypassing authentication, or collecting personal data improperly.
5. What are the common challenges when scraping real estate websites?Common challenges include anti-bot protections, frequently changing page structures, dynamic rendering, rate limits, and image-heavy pages. Using proper request throttling, resilient selectors, and automated monitoring helps maintain stable scraping pipelines.