How to Automate Trulia Real Estate Data Scraping with Python
- Ambily Biju
- Jun 9
- 38 min read

Introduction
Did you know that most home buyers start by searching online? With Trulia, users can browse property listings, compare prices, and analyze neighborhood trends before making a decision. However, manually tracking real estate data across multiple listings is time-consuming and inefficient.
This blog will be a tutorial for you on how to automatically scrape Trulia's real estate listings. You'll learn how, using Python and scraping techniques, you can grab valuable property information such as price, location, amenities, and mortgage rates right from Trulia's listings.
Whether you're a property investor, data analyst, or market researcher, this guide will help you collect structured data at scale, leading to better decision-making and deeper insights into the markets. So let's dive into how this two-step automated web scraping process works!
What is Web Scraping ?
It is the process by which websites are automatically inquired through the use of programming techniques for extracting some form of data. Web scraping retrieves information quickly using a request that is sent to web pages and retrieves content as well as draws out details to be reused again. Here, in Trulia, structured data on real estate, such as price, location, amenities, and mortgage rates, is aggregated without manual effort using web scraping. It turns out to be an essential tool for investors, analysts, and researchers requiring current market information in terms of new trends.
Brief Overview of the Trulia Web Scraping Process
The most crucial role of web scraping plays is in the scraping of data from Trulia, which helps to extract property-related data directly from the website's network APIs and HTML content.This project performs a two-step automated web scraping process to collect detailed information about homes listed on the Trulia website in San Francisco. The first code scrapes unique product links by dynamically generating URLs from a base API endpoint and transaction IDs, then sending POST requests and recursively parsing JSON responses to identify and store links in an SQLite database. The next code takes all those links and scrapes detailed property information by sending GET requests with a randomized user agent and with proxy support to hide IP. Then it parses every property page through BeautifulSoup in terms of fetching home name, location, price, mortgage details, specifications, description, highlights, amenities, and tax data. Extracted data is also stored in a structured SQLite database and logs failed requests for review. This highly robust, end-to-end solution allows comprehensive data gathering for further analysis.
Libraries Behind the Scenes in Trulia Web Scraping
A set of Python libraries serves as the backbone of this project, ensuring smoother execution and robust performance. Let's explore the libraries used, their purposes, and how they contribute to this project.
Requests
Requests is the core of the web communication process in this project. It makes handling HTTP requests, such as GET and POST, a seamless affair with Trulia's servers. In the first phase of the project, requests is used to send POST requests to Trulia's GraphQL API, fetching JSON data that contains product links. In the second phase, the library retrieves HTML content of pages for each property. The ease and reliability make it indispensable for web interactions.
SQLite3
The sqlite3 library provides the project with a powerful yet lightweight database solution. It ensures data is properly organized and persisted throughout the two phases of scraping. In the first phase, it is used in database creation and management that keeps unique product links along with their scraping status. This avoids duplicates as sqlite3 optimizes the scraping process. In the second phase, it captures detailed property information, home names, locations, prices, and logs failed attempts and enables retries for incomplete or unsuccessful scraping operations.
BeautifulSoup
The BeautifulSoup library, taken from bs4, will be a core component of this project in the second stage. It will take in HTML content scraped off property pages, process raw HTML to create a navigable structure such that specific information about names, prices, location, and amenities can be gathered without error. This would rely on the parsing power of BeautifulSoup in processing unstructured web data to structured and usable data.
Random and Time
For natural-looking and bot-detect-resistant scraping, random and time libraries are used. Each request uses random strings for user-agents by reading from a text file. The user-agent string will simulate many different kinds of browsers and versions as they send the requests. The random library also makes it vary the amount of delay between requests that can be used with time library. This intentional slowdown in scraping will keep the project away from being noticed and throttled by web servers.
urllib3
The urllib3 library extends the HTTP capabilities of the project. Specifically, it handles secure HTTPS connections and disables SSL warnings when making connections over proxy servers. This capability is critical to maintaining uninterrupted communication with Trulia's servers, particularly when using a proxy to anonymize requests.
Understanding Proxies and Their Role in Web Scraping
A proxy is an intermediate server between the web scraping script and the target server . The target server never sees the original IP address of the client when the request is relayed through a proxy. It sees the proxy's IP. Masking the original IP address is essential in web scraping to prevent IP blocking and bypass geographical restrictions. Proxies distribute requests across multiple IPs to avoid detection and allow access to region-specific content, ensuring seamless and unrestricted data collection.
This project uses proxies mainly because of IP blocking. Indeed, most of the sites would use some sort of an anti-scraping tactic such as rate limiting or even IP blocking once they determine queries coming from the same IP address are going too frequently or repetitively. A scraping script can spread its traffic across different IP addresses by sending requests through a proxy, or even through a pool of proxies. It also reduces the risk of getting caught and flagged or blocked by a scraper.
Other than preventing IP blocking, proxies are used to bypass geographical restrictions. A website might limit its content only to the location of a user. The proxies from other regions make the scraping script appear to be coming from a variety of places, so it is very helpful in all-around data gathering and evades such restrictions. Proxies also contribute to anonymity as they mask the actual identity and location of the client. This is really helpful in concealing from tracking systems that flag suspicious activity for attention. This, however is not the end; with proxies scraping is now more reliable and scalable. If one proxy gets flagged or blocked, the scraper can switch to another proxy to make sure that data extraction isn't interrupted.
Proxies are smoothly integrated in the request mechanism of this project. The proxy settings define the urllib3 library. This will make all the requests in an HTTPS secure connection go to a proxy server, which should provide anonymity, accessibility along with traffic distribution that creates quite a robust process of web scraping. Proxies, with no doubt, are a bit of a luxury and a necessity in any web scraping project, especially when dealing with serious measures put in place against scraping.Users can either choose Datahut's proxy services or opt for any other free or paid proxy services based on their preferences and requirements.
STEP 1 : Product URLs Scraping
Importing Required Libraries
import requests
import sqlite3
The code imports the requests library for sending HTTP requests to retrieve web content and the sqlite3 library for interacting with an SQLite database to store and manage scraped data.
SQLite Database Configuration
# SQLite database configuration
DB_NAME = "Trulia_Webscraping.db"
TABLE_NAME = "product_links"
This section defines the configuration for the SQLite database used in the project. The DB_NAME variable specifies the name of the database file which serves as the storage location for the scraped data. The TABLE_NAME variable sets the name of the table within the database. This table is used to store the links of the products . By defining these constants, the script ensures consistency and simplifies database operations such as table creation, data insertion, and querying.
URL and Transaction IDs Configuration
# Constant part of the URL
BASE_URL="https://www.trulia.com/graphql?operation_name=WEB_searchResultsMapQuery&transactionId="
# List of transaction IDs
TRANSACTION_IDS = [
"30a21e3f-5b7e-49b6-8deb-01f0e5f8cb07","bf8443bc-991e-4e80-85bb-fd47c1d3e615",
"d441c1be-0201-487e-b5cc-79a93795d9c4","5b006fc1-0a85-49ef-a3ae-985ec3c320ae",
"1f4b143a-71f2-45d7-b7ec-dd78dcc69cd1","82d9baa6-ca36-444b-b97f-ac6fa65ee79f",
"794ae677-5a6d-4e44-8dc0-7146a6ef574e","fd7eb450-99f2-49c4-8bbf-7ca2180d5214",
"fddcc9ce-cd47-4b9b-9d70-1cc1a1886a6b","fd7e7e4a-cee9-4205-b685-01f66b593c60",
"32db8852-62df-49c3-942c-46c1526b049c","ac86cff6-2d9a-4502-9588-a179ec9a5fbd",
"8033dce0-b699-47d4-abcd-8be1b85741c3","cd904320-f180-43b2-95b7-6cdc864b9ab1",
"0ff9c18e-6e09-4658-838c-571fcf9a5971","bb34872c-2a95-4806-8c91-c18df7228497",
"eae572f1-d55e-440f-88ca-b3f97a0e1732","7f325934-9ca3-4743-b941-4c003b25abbe",
"fead1864-4f52-4008-8f10-3de9e1220567","ea8693cf-8b20-4fca-87d4-664372db7b11",
"f48b7f88-a9df-4399-af9e-31bbedbf85ce","7856e515-3c00-4145-9f86-8554c53370cc",
"e92d4d95-f803-4787-a26f-2475b134d0fb","3ea3f7df-4c91-4ad3-be01-ac5356de0454",
"d6aaaf47-b5fb-4c87-b447-ebbcea97f349"
]
It forms a list of requests toward Trulia using the base URLs. BASE_URL refers to the constant part of each network API url, and TRANSACTION_IDS are unique IDs that are combined with the base URL to give each request its full URL form. By combining the BASE_URL with these transaction IDs, the program can create a list of URLs to get data from the website. This method manages many requests well while keeping the code clear and easy to expand.
Setting Up Proxy Configuration
PROXIES = {
"http": "http://datahutapi:password@proxy-server.datahutapi.com:8001",
"https": "http://datahutapi:password@proxy-server.datahutapi.com:8001"
}
Here defines a dictionary called PROXIES. This dictionary contains configurations with regard to connecting to the Internet using a proxy server. The dictionary has two keys: "http" and "https". They are HTTP and HTTPS protocols respectively. For both of them, the value will be a URL that shows an address of the proxy server plus username and password with regards to authentication. The proxy server is set at proxy-server.datahutapi.com and running on port 8001. Now, any kind of request made using such protocols pass through the particular proxy server set and which can then hide the client's IP address or also bypass various restrictions.
Defining HTTP Headers
HEADERS = {
'accept': '*/*',
'accept-language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6',
'cache-control': 'no-cache',
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
}
The HEADERS dictionary in the code defines HTTP headers being sent with a web request. These are very important while performing web scraping as they make requests look as if they are coming from a real web browser. Hence the server treats this request as valid and not coming from a bot or some automated tool. Headers like User-Agent help the scraping script determine it to be some special browser that helps the server not to reject the request considering it suspicious or unknown. Accept and Accept-Language headers also specify kinds of content a response ought to have and the favored language, so the scraper will be more adaptable to the different structure of the website and localization. This code does create headers like a regular browser request without cache enforced on them . The content type is set to JSON, which is very commonly used in API-based data fetching. These headers help the scraper not get detected, smooth extraction of data, and return content in a structured and readable format like JSON.
Function to Create SQLite Database and Table
def create_database(db_name, table_name):
"""
Create an SQLite database and a table if it doesn't exist.
This function initializes a database connection and creates a table with
the specified name. The table includes the following columns:
- `product_link`: A unique identifier for each product (primary key).
- `status`: An integer field to track the scraping status of the link
(default value is 0, indicating not scraped).
Parameters:
db_name (str): The name of the SQLite database file.
table_name (str): The name of the table to be created.
Returns:
None
"""
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
cursor.execute(f"""
CREATE TABLE IF NOT EXISTS {table_name} (
product_link TEXT PRIMARY KEY,
status INTEGER DEFAULT 0
)
""")
conn.commit()
conn.close()
This function, create_database, is meant to initialize an SQLite database and create a table if it does not already exist. It first establishes a connection to the specified SQLite database and then defines a table with the given name. The table contains two columns. The first column, which is product_link, uniquely determines each product and therefore takes the role of the primary key in the table. The status of each product link can now be tracked using an integer field whose default value is 0-the link has not yet been scraped. The function does that by ensuring that the table creation only happens if the table does not already exist hence preventing redundant table creations. After executing the needed SQL commands, the database connection is committed and closed. It is often used in a web scraping project to maintain and track the status of all scraped product links efficiently.
Function to Send POST Request with Transaction ID
def send_request(transaction_id):
"""
Send a POST request to the specified URL with transaction ID.
This function sends a POST request to the Trulia website's GraphQL endpoint
with the provided transaction ID . It handles the request using
the specified headers, proxies, and disables SSL verification. If an error
occurs during the request, it logs the error and returns None.
Parameters:
transaction_id (str): The unique identifier used in the URL to fetch
specific data from the server.
Returns:
response (Response or None): The response object if the request is successful,
or None if an error occurs during the request.
"""
url = BASE_URL + transaction_id
try:
response = requests.post(
url,
headers=HEADERS,
proxies=PROXIES,
verify=False
)
return response
except Exception as e:
print(f"Error sending request for \
transaction ID {transaction_id}: {e}")
return None
This function will send a POST request in send_request(). It gets an ID for the specified URL, sending a possible data request it may call for on the server and uses it in building full URLs with an append, then takes all the appended transactions, constructing its very content. It sends the request using the requests.post method with all the pre-set headers (HEADERS) and proxied setting up (PROXIES), and at the same time, utilize verify=False for disabling the SSL verification; this ensures the request moves on even if the SSL certificates can't be verified. And then the function will return the response object in case the request was a success. This is highly useful in the context of web scraping or APIs, where dynamic fetching of information using such identifiers as transaction IDs has to be done. But when there was some error, an exception is caught by this function, logs an error with information about transaction_id, and returns None.
Extracting Product Links from Trulia's API Response
def extract_links_from_json(data):
"""
Extract unique product links recursively from JSON data.
This function recursively traverses the given JSON data, which can be in the
form of nested dictionaries or lists, to find all product links that start
with "/home/". These links are then formatted with the base URL and added to
a set to ensure uniqueness.
Parameters:
data (dict or list): The JSON data structure to be searched, which could
contain nested dictionaries or lists.
Returns:
set: A set of unique product links, each starting with "/home/".
"""
links = set()
def recursive_search(obj):
if isinstance(obj, dict):
for value in obj.values():
recursive_search(value)
elif isinstance(obj, list):
for item in obj:
recursive_search(item)
elif isinstance(obj, str) and obj.startswith("/home/"):
links.add("https://www.trulia.com" + obj)
recursive_search(data)
return links
The function extract_links_from_json is designed to scan complex JSON data that could possibly contain nested dictionaries or lists and retrieve unique product links starting with the path "/home/'. These are links to individual product pages, so the base URL https://www.trulia.com is added to any link found. There's a useful function named recursive_search that performs a search on every node of the JSON structure. It runs over all the values in the dictionary. If it identifies a list, it runs over that list and checks each item on that list. It then recognizes that a string starts with "/home/" and attaches the base URL to that link, stores it in a set so that it will not be counted as a duplicate, and continues this pattern for each item. During its process, the function at its very end returns all uniquely formed and fully formatted links found to products by being picked up from JSON.
Saving Extracted Product Links to a Database
def save_links_to_database(db_name, table_name, links):
"""
Inserts unique product links into the SQLite database table.
This function connects to the SQLite database specified by the
`db_name` parameter, and inserts product links into the table
specified by `table_name`. If a link already exists in the table,
it will be ignored to ensure that only unique links are stored.
Args:
db_name (str): The name of the SQLite database to connect to.
table_name (str): The name of the table where the links will be inserted.
links (iterable): A collection of unique product links to be saved into the database.
Raises:
sqlite3.Error: If there is an error during the database operation (e.g.,
failure to insert a link into the database).
"""
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
for link in links:
try:
cursor.execute(f"INSERT OR IGNORE INTO {table_name} "
"(product_link) VALUES (?)", (link,))
except sqlite3.Error as e:
print(f"Database error while saving link "
f"{link}: {e}")
conn.commit()
conn.close()
The function save_links_to_database saves all the links for the products it finds in a database. The database used here is SQLite. Its name is determined by the parameter db_name and the table name given as table_name. This function accepts a list of product links as an argument and tries to add each link to the database. The function utilizes the SQL command INSERT OR IGNORE, which ignores any links that are already in the table. This ensures no data is copied. A try block is used for each link so that any errors during the database operation will be caught. If it encounters some sort of error, then the error will be posted; however, all the rest links work properly. Subsequent to that, if there are links remaining, all links' changes will be stored within the database and connections will close up after guaranteeing the fact that the safe link of the product has moved toward the database.
Processing Transaction IDs to Extract Product Links
def process_transaction_ids(transaction_ids):
"""
Process a list of transaction IDs, send requests for each, and extract unique product links.
This function iterates over a list of transaction IDs, sends a request for each,
and processes the response to extract product links. These links are collected
into a set of unique links, which is returned at the end.
Args:
transaction_ids (list): A list of transaction IDs to be processed.
Returns:
set: A set containing unique product links extracted from the responses.
"""
unique_links = set()
for transaction_id in transaction_ids:
response = send_request(transaction_id)
if response and response.status_code == 200:
data = response.json()
links = extract_links_from_json(data)
unique_links.update(links)
print(f"Extracted links from transaction ID {transaction_id}")
else:
print(f"Request failed for transaction ID {transaction_id} "
f"with status code {response.status_code if response else 'N/A'}")
return unique_links
The function process_transaction_ids accepts a list of transaction IDs, sends a request for each one, and returns a set of unique product links from the responses. For each transaction ID in the list, the function calls the send_request function. If the response is successful with a status code of 200, it takes the response data, converts it into JSON format, and passes the result to extract_links_from_json, which then parses out the links to the products from the JSON and puts them in a set called unique_links. Since sets are ordered, the duplicates will be automatically eliminated. If the request is not successful for any given transaction ID, the function prints an error message associated with the status code it received. Finally, it returns the set of all unique product links collected across all the transaction IDs for a good, efficient product link gathering from multiple API responses while handling possible errors in requests gracefully.
Main Execution Flow for Extracting and Saving Product Links
# Main Execution
if __name__ == "__main__":
"""
Main execution flow for processing transaction IDs, extracting product links,
and saving them to an SQLite database.
This script performs the following steps:
1. Creates a database and table for storing product links.
2. Processes a list of transaction IDs to extract unique product links.
3. Saves the unique product links into the database.
The script prints a confirmation message when all links have been saved successfully.
"""
# Step 1: Create database and table
create_database(DB_NAME, TABLE_NAME)
# Step 2: Process transaction IDs and extract unique links
unique_links = process_transaction_ids(TRANSACTION_IDS)
# Step 3: Save unique links to the database
save_links_to_database(DB_NAME, TABLE_NAME, unique_links)
print(f"All unique product links have been saved to the database {DB_NAME}"
f"in table {TABLE_NAME}")
This is the body of execution part.In that section of the script that has to do with fetching links from Trulia's API and storing them in a database. First, declare the database and a table in which the links are going to be stored using the function create_database. Then, it processes each item in a list of predetermined transaction IDs by passing this item to the process_transaction_ids function, which goes through unique product links contained within each API response to the specific transaction ID and finally saves the extracted links in the database using the function save_links_to_database in a way that will exclude saving any duplicate entries. Last it gives a confirmation message after print, stating that all links are saved into a given database and table. This flow merges operations on the database, interaction with API and data processing so that the entire process becomes efficient and reusable.
STEP 2 : Product Data Scraping From Product Links
Importing Necessary Libraries
import sqlite3
import requests
from bs4 import BeautifulSoup
import random
import time
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
Importing necessary libraries. sqlite3 for database interaction, requests for making HTTP requests to fetch webpage content, and BeautifulSoup from the bs4 library to parse and extract data from the HTML of each page. Additionally, the random and time libraries are used to introduce random delays between requests. The script also disables SSL warnings using urllib3 to handle scenarios where HTTPS certificates might not be fully trusted.
Setting Up Proxy Configuration
PROXIES = {
"http": "http://datahutapi:password@proxy-server.datahutapi.com:8001",
"https": "http://datahutapi:password@proxy-server.datahutapi.com:8001"
}
Here defines a dictionary called PROXIES. This dictionary contains configurations with regard to connecting to the Internet using a proxy server. The dictionary has two keys: "http" and "https". They are HTTP and HTTPS protocols respectively. For both of them, the value will be a URL that shows an address of the proxy server plus username and password with regards to authentication. The proxy server is set at proxy-server.datahutapi.com and running on port 8001. Now, any kind of request made using such protocols pass through the particular proxy server set and which can then hide the client's IP address or also bypass various restrictions.
Setting Up the Database for Scraping
DATABASE = "Trulia_Webscraping.db"
The variable DATABASE is used to define the name of the SQLite database file, which in this case is "Trulia_Webscraping.db". This database serves as the central storage for managing data throughout the scraping process. It is used to store information such as the list of product links extracted from the Trulia API and any detailed data scraped from individual product pages. By using a database, the script ensures that data is organized, persistent, and easily retrievable for further analysis, even if the scraping process is interrupted or restarted. SQLite, being lightweight and easy to use, is a suitable choice for this task, allowing efficient handling of structured data without requiring additional setup.
Loading User Agents for Scraping
# Load user agents from file
def load_user_agents(file_path):
"""
Load a list of user agent strings from a file.
Args:
file_path (str): Path to the file containing user agent strings.
Each line in the file should represent one user agent.
Returns:
list: A list of user agent strings, with leading and trailing
whitespace removed from each line.
"""
with open(file_path, "r") as f:
return [line.strip() for line in f.readlines()]
USER_AGENTS = load_user_agents("data/user_agents.txt")
A load_user_agents function loads in a list of user agent strings from a file. User agents are text strings that one of the web browsers or tool sends to the website as identity. Using different user agents when scraping, the script is actually mimicking requests coming from various browsers or devices so that the whole process seems a bit more anonymous and may not be easily blocked by the website. This function reads a file that's specified by the file_path argument. In this file, every line contains one user agent string. It removes any extra spaces or newline characters from each line and returns them as a list. These user agents are applied at the time of scraping to randomly choose different identifiers for each request with the simulation of natural browsing habits, which enhances the reliability of a data scraping process.
Creating Database Tables for Scraping
# Function to create tables if they don't exist
def create_tables():
"""
Create necessary tables in the SQLite database if they do
not already exist.
Tables:
1. failed_urls:
- id (INTEGER): Primary key, auto-incremented.
- url (TEXT): The URL that failed to process.
- reason (TEXT): Reason for the failure (optional).
- timestamp (TIMESTAMP): Timestamp of the failure
(defaults to the current time).
2. product_data:
- id (INTEGER): Primary key, auto-incremented.
- product_link (TEXT): URL of the product.
- home_name (TEXT): Name of the home or product.
- location (TEXT): Location details of the product.
- price (TEXT): Price of the product.
- mortgage (TEXT): Mortgage information (if available).
- specification (TEXT): Specifications of the product.
- description (TEXT): Product description.
- highlights (TEXT): Key highlights of the product.
- amenities (TEXT): Amenities included with the product.
- tax (TEXT): Tax-related information.
- timestamp (TIMESTAMP): Timestamp of when the entry was
created (defaults to the current time).
The function establishes a connection to the SQLite database specified
by the `DATABASE` constant,creates the tables if they do not exist,
commits the changes, and then closes the connection.
Returns:
None
"""
conn = sqlite3.connect(DATABASE)
cursor = conn.cursor()
# Create the failed_urls table
cursor.execute("""
CREATE TABLE IF NOT EXISTS failed_urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL,
reason TEXT,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
""")
# Create the product_data table
cursor.execute("""
CREATE TABLE IF NOT EXISTS product_data (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_link TEXT NOT NULL,
home_name TEXT,
location TEXT,
price TEXT,
mortgage TEXT,
specification TEXT,
description TEXT,
highlights TEXT,
amenities TEXT,
tax TEXT,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
""")
conn.commit()
conn.close()
The create_tables function is there to make sure the SQLite database has all the tables necessary for the actual process of scraping. It connects to the database pointed to by the constant DATABASE and creates two tables: failed_urls and product_data. The failed_urls table is there to keep track of URLs that failed in the scraping process. This includes the URL itself and an optional reason for failure together with the timestamp of failure. This will track problems that might have occurred while scraping. It will permit retrying or debugging later. Product details are placed in the table product_data; it will be used to store information gathered from the product page like the name of the product, location, price, mortgage details, specifications, description, highlights, amenities, and tax information. There is a timestamp showing when the data was added to the table. The function returns nothing if the tables are already in the database and does not overwrite data. Finally, it saves changes and closes the connection to the database in order to make sure everything is set up before actually scraping.
Fetching the Next URL to Scrape
# Fetch the next URL to scrape
def get_next_url():
"""
Fetch the next URL to scrape from the database.
The function retrieves a single URL from the `product_links`
table where the `status` is 0, indicating that the URL has
not yet been scraped. It returns the first available URL
or `None` if no such URL exists.
Returns:
str or None: The next product URL to scrape, or `None`
if no URLs are pending.
Notes:
- The function assumes the existence of a `product_links`
table with the following structure:
- product_link (TEXT): The URL of the product.
- status (INTEGER): A flag indicating the scrape status
(0 for pending, 1 for completed).
"""
# Connect to the database
conn = sqlite3.connect(
DATABASE
)
# Create a cursor object to execute SQL queries
cursor = conn.cursor()
# Execute the SQL SELECT query to fetch the next URL
# with a pending status (0)
cursor.execute(
"""
SELECT
product_link
FROM
product_links
WHERE
status = 0
LIMIT 1
"""
)
# Fetch the result of the query
result = cursor.fetchone()
# Close the database connection
conn.close()
# Return the URL if a result is found, else return None
return result[0] if result else None
This get_next_url function retrieves the next product URL to scrape from the database. This works by connecting to the SQLite database and querying the product_links table, which contains the list of URLs to scrape. The function searches for a URL where the status is set to 0, meaning that it hasn't been scraped yet. The LIMIT 1 in the query will limit the result to the first available URL. If there's a URL, the function will return it; if there's no URL with the pending status, then it will return None. This function assists in tracking which URLs are supposed to be processed so that the script scrapes new links in an orderly fashion. Finally, it closes the database connections opened to avoid the formation of open connections when once a query is executed fetching results and returns a returned result containing the URL that goes in further processing.
Updating the Status of a URL
# Update the status of a URL
def update_url_status(url, status):
"""
Update the status of a URL in the database.
This function sets the `status` of a given URL in
the `product_links` table.The status indicates whether
the URL has been processed or not.
Args:
url (str): The URL whose status needs to be updated.
status (int): The new status to set. Common values
might include:
- 0: Pending
- 1: Completed
- Other values as defined by the application.
Returns:
None
Notes:
- The function assumes the existence of a `product_links`
table with the following structure:
- product_link (TEXT): The URL of the product.
- status (INTEGER): A flag indicating the scrape status.
"""
# Connect to the database
conn = sqlite3.connect(
DATABASE
)
# Create a cursor object to execute SQL queries
cursor = conn.cursor()
# Define the SQL UPDATE query to modify the status of the URL
cursor.execute(
"""
UPDATE
product_links
SET
status = ?
WHERE
product_link = ?
""",
(
status,
url
)
)
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
The update_url_status function is used to update the status of a URL in the database after it has been processed. It takes two inputs: the url that needs to be updated and the new status to be set. The status helps track whether the URL has been scraped or not, with common values being 0 for pending and 1 for completed. This function will establish the connection with the SQLite database. It creates a cursor then it sends the following UPDATE query in relation to the product_links table of new status to that given URL. Once this update operation has been done, its results are committed into the database which will ensure those modifications permanent and closes the database. This function makes it possible for the system to keep track of which URLs were processed, thereby preventing rescraping of the same URLs.
Saving Scraped Data to the Database
# Save scraped data to the database
def save_data(product_link, data):
"""
Save scraped product data to the database.
This function inserts the scraped product data into
the `product_data` table in the database.The table
should include fields such as product link, home name,
location, price, and other details.
Args:
product_link (str): The URL of the product being saved.
data (tuple): A tuple containing the product details in
the following order:
- home_name (str): Name of the product or home.
- location (str): Location details.
- price (str): Price of the product.
- mortgage (str): Mortgage information (if available).
- specification (str): Specifications of the product.
- description (str): Description of the product.
- highlights (str): Key highlights of the product.
- amenities (str): Amenities included with the product.
- tax (str): Tax-related information.
Returns:
None
Notes:
- The function assumes the existence of a `product_data`
table with the following structure:
- product_link (TEXT): The URL of the product.
- home_name (TEXT): Name of the home or product.
- location (TEXT): Location details.
- price (TEXT): Price of the product.
- mortgage (TEXT): Mortgage information.
- specification (TEXT): Specifications of the product.
- description (TEXT): Product description.
- highlights (TEXT): Key highlights of the product.
- amenities (TEXT): Amenities included with the product.
- tax (TEXT): Tax-related information.
"""
# Connect to the database
conn = sqlite3.connect(
DATABASE
)
# Create a cursor object to execute SQL queries
cursor = conn.cursor()
# Define the SQL INSERT query for the product_data table
cursor.execute(
"""
INSERT INTO product_data (
product_link,
home_name,
location,
price,
mortgage,
specification,
description,
highlights,
amenities,
tax
)
VALUES (
?, ?, ?, ?, ?, ?, ?, ?, ?, ?
)
""",
(
product_link,
*data
)
)
# Commit the changes to the database
conn.commit()
# Close the database connection
conn.close()
The save_data function saves the product data in details, scraped from a certain URL, to the database. It takes two parameters: the product link, which is the URL of the product being scraped, and data, which is a tuple containing a variety of information regarding the product including its name, location, price, mortgage information, specifications, description, highlights, amenities, and tax details. This function links the SQLite database, allowing a cursor to execute SQL queries. It prepares and executes an INSERT INTO SQL query, placing the data scraped into the product_data table of the database. The table has been structured to store all the necessary information for products in columns: product_link, home_name, location, price, and many more. After entering all the data, the changes that have been made are commited to the database so as to save them permanently; then the connection to the database is closed. Thus, this function ensures proper organization and structuring of all scraped data for the future use or analysis.
Saving Failed URLs to the Database
# Save failed URLs to the database
def save_failed_url(url, reason):
"""
Save a failed URL and the reason for failure to
the database.
This function logs URLs that could not be successfully
processed into the `failed_urls` table, along with the
reason for the failure.
Args:
url (str): The URL that failed to be processed.
reason (str): A brief description of the reason
for the failure.
Returns:
None
Notes:
- The function assumes the existence of a `failed_urls`
table with the following structure:
- url (TEXT): The URL that failed.
- reason (TEXT): Reason for the failure.
- timestamp (TIMESTAMP): Timestamp of when
the failure occurred (defaults to the current time).
"""
conn = sqlite3.connect(DATABASE)
cursor = conn.cursor()
cursor.execute("""
INSERT INTO failed_urls (url, reason)
VALUES (?, ?)
""", (url, reason))
conn.commit()
conn.close()
The save_failed_url() function is used to save all the links that could not be scraped with the corresponding reason for this inability. There are two parameters: one is url, that is to say, the URL that cannot be scraped. The reason is a very short phrase of why this is so in the first place-maybe a network issue, or it is because of a 404 error. The function connects to the SQLite database, and then it creates a cursor to be able to execute SQL commands. This function executes an INSERT INTO query, which stores the failed URL and the reason in the failed_urls table. It maintains URLs that failed by having columns for the URL itself, the failure reason, and a timestamp to indicate when the failure took place. Once the information has been inputted, it commits the change and thereby saves it in the database while also closing the connection. This is helpful to be able to monitor and debug scraping errors while making available the possibility of re-processing failed URLs later.
Making a Request with Random User-Agent and Delay
# Request wrapper with random User-Agent and delay
def make_request(url):
"""
Make an HTTP GET request to a specified URL with a
random User-Agent and a random delay.
This function sends a GET request using the `requests`
library, selecting a random User-Agent from a predefined
list to simulate realistic browsing behavior. A random
delay is added between requests to reduce the likelihood
of being flagged as a bot.
Args:
url (str): The URL to send the GET request to.
Returns:
Response: The HTTP response object returned by the
`requests.get` method.
Notes:
- Assumes the existence of:
- `USER_AGENTS`: A list of User-Agent strings.
- `PROXIES`: A dictionary of proxy settings
for the request.
- Uses `time.sleep` to introduce a random delay between
20 and 30 seconds.
- SSL verification is disabled (`verify=False`).
Be cautious when using this in production.
"""
headers = {"User-Agent": random.choice(USER_AGENTS)}
response = requests.get(
url, headers=headers,
proxies=PROXIES, verify=False
)
time.sleep(random.uniform(20, 30)) # Random delay
return response
The make_request function makes a GET request to a particular URL, simulating the presence of a real user navigating through the web. First, it picks a random User-Agent from a set of User-Agent strings so that the script doesn't appear to be originating from the same browser or the same device, thereby raising fewer suspicions as a bot for the website it visits. In addition, the time.sleep is used to introduce a random delay of between 20 and 30 seconds to simulate natural browsing behavior and not to over-load the website with too many requests in a very short time. The function makes the request by calling the requests.get method using the randomly selected User-Agent and proxy settings if available. With security in mind, this function turns off SSL verification with verify=False; however, it should be used very cautiously in production environments. The function waits for the random delay after sending the request and returns the HTTP response, which may then be used to pull out data from the page.
Extracting the Home Name from HTML
def extract_home_name_from_html(soup):
"""
Extract the home name from the HTML content.
This function searches for a `<span>` element with the
attribute `data-testid='home-details-summary-headline'`
in the provided BeautifulSoup object and retrieves its
text content.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing
the parsed HTML content.
Returns:
str or None: The extracted home name as a string if
the element is found; otherwise, `None`.
Notes:
- The function uses the `get_text` method with `strip=True`
to remove leading and trailing whitespace.
- Returns `None` if the specified `<span>` element is not
found in the HTML.
"""
# Search for the span containing the home name
home_name_span = soup.find(
'span',
{'data-testid': 'home-details-summary-headline'}
)
# Extract the text content if the home name span exists
home_name = (
home_name_span.get_text(
strip=True
)
if home_name_span
else None
)
# Return the extracted home name or None
return home_name
The extract_home_name_from_html function is designed to extract the name of a home or property from a webpage by parsing the HTML content using BeautifulSoup. It looks for a specific <span> tag in the HTML that contains the attribute data-testid='home-details-summary-headline'. This attribute is unique to the home name on the page. The function proceeds to find the tag. Inside the tag, it fetches the text using the get_text() method. It removes leading and trailing extra spaces as it does that. In a case where the tag doesn't appear on the page, the function returns as None. Otherwise, the function returns the home name extracted as a string; this will be useful to use later in the scraper to extract information regarding the property.
Extracting Location from HTML
def extract_location_from_html(soup):
"""
Extract the location details from the HTML content.
This function searches for a `<span>` element with
the attribute `data-testid='home-details-summary-city-state'`
in the provided BeautifulSoup object and retrieves its
text content.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing
the parsed HTML content.
Returns:
str or None: The extracted location as a string if the
element is found; otherwise, `None`.
Notes:
- The function uses the `get_text` method with
`strip=True` to remove leading and trailing whitespace.
- Returns `None` if the specified `<span>` element
is not found in the HTML.
"""
# Search for the span containing the location details
location_span = soup.find(
'span',
{'data-testid': 'home-details-summary-city-state'}
)
# Extract the text content if the location span exists
location = (
location_span.get_text(
strip=True
)
if location_span
else None
)
# Return the extracted location or None
return location
The extract_location_from_html function is specifically used to get the property location details from the HTML contents of a Trulia website. It uses BeautifulSoup, scanning the HTML for existence of a <span> tag with attribute data-testid='home-details-summary-city-state', and then extracts information contained within this tag inside using the get_text() function. This method also removes excess white spaces from the text before returning it. It returns None if the tag cannot find the <span> tag in the HTML otherwise, it returns a location detail as a string. The string can then be used in the process of collecting data.
Extracting Price from HTML
def extract_price_from_html(soup):
"""
Extract the price details from the HTML content.
This function searches for a `<div>` element with
specific classes that likely contain the price
information in the provided BeautifulSoup object
and retrieves its text content.
Args:
soup (BeautifulSoup): A BeautifulSoup object
containing the parsed HTML
content.
Returns:
str or None: The extracted price as a string if
the element is found; otherwise,
`None`.
Notes:
- The function uses the `get_text` method with `strip=True`
to remove leading and trailing whitespace.
- Returns `None` if the specified `<div>` element is not
found in the HTML.
- The class names used are based on a specific structure
and may need adjustment if the HTML structure changes.
"""
# Search for the div containing the price information
price_div = soup.find(
'div',
class_='Text__TextBase-sc-13iydfs-0-div '
'Text__TextContainerBase-sc-13iydfs-1 hObzVe icHjbr'
)
# Extract the text content if the price div exists
price = (
price_div.get_text(
strip=True
)
if price_div
else None
)
# Return the extracted price or None
return price
This function is designed to extract a property's price information from the HTML content of the Trulia webpage. This function is searching for an <div> tag containing details about the price, which carries class names that are unique for its identification. This function uses BeautifulSoup to look for the specified <div> and pulls in the text with get_text(), which gets rid of leading and trailing whitespaces around the price, then returns the price. If it finds the specified <div>, the function does return the price as a string; otherwise, it simply returns None.
Extracting Estimated Mortgage from HTML
def extract_estimated_mortgage_from_html(soup):
"""
Extract the estimated mortgage details from the
HTML content.
This function searches for a `<div>` element with
the attribute `data-testid='summary-mortgage-estimate-details'`
in the provided BeautifulSoup object and retrieves its
text content.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing
the parsed HTML content.
Returns:
str or None: The extracted estimated mortgage information
as a string if the element is found; otherwise, `None`.
Notes:
- The function uses the `get_text` method with `strip=True`
to remove leading and trailing whitespace.
- Returns `None` if the specified `<div>` element is not
found in the HTML.
"""
# Search for the div containing the estimated mortgage details
mortgage_div = soup.find(
'div',
{'data-testid': 'summary-mortgage-estimate-details'}
)
# Extract the text content if the mortgage div exists
estimated_mortgage = (
mortgage_div.get_text(
strip=True
)
if mortgage_div
else None
)
# Return the extracted estimated mortgage details or None
return estimated_mortgage
This extract_estimated_mortgage_from_html function is used to scrape the estimated mortgage information from the HTML of a Trulia property page. It searches for a specific <div> element holding the mortgage estimate details identified by a unique attribute, data-testid='summary-mortgage-estimate-details'. If this element exists on the page's HTML, the function pulls its text content that contains mortgage information and strips out extra spaces using the get_text() method with the argument strip=True. If no such element exists, the function returns None.
Extracting Property Specifications from Trulia
def extract_specifications(soup):
"""
Extract the specifications from the HTML content.
This function searches for a `<div>` element with
the attribute `data-testid='facts-list'` in the
provided BeautifulSoup object and retrieves its
text content. The `separator=' | '` parameter
is used to join the text content with a pipe
character for better readability.
Args:
soup (BeautifulSoup): A BeautifulSoup object
containing the parsed HTML content.
Returns:
str or None: The extracted specifications as a
string if the element is found; otherwise, `None`.
Notes:
- The function uses the `get_text` method with `strip=True`
to remove leading and trailing whitespace.
- A custom separator `|` is used to join the text parts
together for better display.
- Returns `None` if the specified `<div>` element is not
found in the HTML.
"""
# Search for the parent div containing the specifications
parent_div = soup.find(
'div',
{'data-testid': 'facts-list'}
)
# Extract the text content with a custom separator
# if the parent div exists
specifications = (
parent_div.get_text(
strip=True,
separator=' | '
)
if parent_div
else None
)
# Return the extracted specifications or None
return specifications
Extract-specifications is a function, taking an input of a property's webpage. It searches for a div <div> element with an attribute of data-testid='facts-list.' These might have data containing facts like the number of rooms, square footage, etc. After finding the mentioned <div> element, it uses the get_text() method to get the extracted text content. To make the extracted information more readable, the function uses the separator=' | ' argument to join the individual pieces of text with a pipe character (|). This helps in organizing the specifications in a clear and structured format. If the specified element is not found, the function returns None.
Extracting Property Description from Trulia
def extract_description_from_html(soup):
"""
Extract the description details from the HTML content.
This function searches for a `<div>` element with the
attribute `data-testid='home-description-text-description-text'`
in the provided BeautifulSoup object and retrieves its text content.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing
the parsed HTML content.
Returns:
str or None: The extracted description as a string if
the element is found; otherwise, `None`.
Notes:
- The function uses the `get_text` method with `strip=True`
to remove leading and trailing whitespace.
- Returns `None` if the specified `<div>` element is not
found in the HTML.
"""
# Search for the description div in the soup object
description_div = soup.find(
'div',
{'data-testid': 'home-description-text-description-text'}
)
# Extract the text content if the description div exists
description = (
description_div.get_text(strip=True)
if description_div
else None
)
# Return the extracted description or None
return description
This is used to extract the description of a property from a Trulia property page by looking through the HTML for a <div> element containing the description, identified by the attribute data-testid='home-description-text-description-text'. Once it encounters that element, it returns text contained within it using the get_text() function, which also removes any whitespace that could be present at the start and end of the text, as the strip=True argument was passed to the method. It returns None meaning that it did not find a description for this property, in case such an element does not exist on the page. This function captures all relevant information regarding the description of property but depends on the page structure.
Extracting Home Highlights from Trulia
def extract_home_highlights_from_html(soup):
"""
Extract the highlights details from the HTML content.
This function searches for all `<div>` elements with
the class `Grid__GridContainer-sc-144isrp-1 iXzkWe`
within the provided BeautifulSoup object. It then
iterates over these containers, extracting
the key-value pairs where the key is represented by
a `<div>` with the class `Text__TextBase-sc-13iydfs-0-div
Text__TextContainerBase-sc-13iydfs-1 cwsXtm icHjbr`
and the value by a `<div>` with the class
`Text__TextBase-sc-13iydfs-0-div
Text__TextContainerBase-sc-13iydfs-1 IETTU icHjbr`.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing
the parsed HTML content.
Returns:
dict or None: A dictionary of highlights with keys and
their corresponding values if any highlights
are found; otherwise, `None`.
Notes:
- The function uses the `get_text` method with `strip=True`
to remove leading and trailing whitespace.
- Returns `None` if no relevant highlights are found in
the HTML.
"""
# Find all highlight containers in the soup object
highlights_container = soup.find_all(
'div',
class_='Grid__GridContainer-sc-144isrp-1 iXzkWe'
)
# Initialize an empty dictionary to store highlights
highlights = {}
# Iterate over each highlight container
for container in highlights_container:
# Find the key element in the current container
key_tag = container.find(
'div',
class_='Text__TextBase-sc-13iydfs-0-div '
'Text__TextContainerBase-sc-13iydfs-1 cwsXtm icHjbr'
)
# If the key element exists, extract the key text
if key_tag:
key = key_tag.get_text(strip=True)
# Find the value element(s) in the current container
value_tag = container.find_all(
'div',
class_='Text__TextBase-sc-13iydfs-0-div '
'Text__TextContainerBase-sc-13iydfs-1 IETTU icHjbr'
)
# Extract the value text if the value element exists
value = (
value_tag[0].get_text(strip=True)
if value_tag
else None
)
# Add the key-value pair to the highlights dictionary
highlights[key] = value
# Return the highlights dictionary if not empty, otherwise return None
return highlights if highlights else None
The extract_home_highlights_from_html function is to scan through the highlights of a real estate property posted on the website of Trulia by scanning for multiple <div> elements on the webpage's HTML containing highlight information. It does this by class, which is Grid__GridContainer-sc-144isrp-1 iXzkWe. After the function finds these containers, it then looks to key-value pairs within them. The key is found in a <div> with a specific class, and the value associated with that key is located in another <div> with a different class. The function extracts the text for both the key and the value, removing any unnecessary spaces around the text. These key-value pairs are then stored in a dictionary, which is returned by the function. If no highlights are discovered, the function returns None. This method allows the function to retrieve all relevant features of property attributes, including amenities, conditions, or special features, all of which are stored neatly as a dictionary for further processing.
Extracting Structured Amenities Tables
def extract_all_amenities_tables(soup):
"""
Extract all structured amenities tables from the
HTML content.
This function finds all `div` elements with the attribute
`data-testid='structured-amenities-table-category'`
within the provided BeautifulSoup object. For each table,
it retrieves rows that contain subcategories and their
associated details. The function then builds a dictionary
where each key is a subcategory and each value is a list
of details for that subcategory.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing
the parsed HTML content.
Returns:
list or None: A list of dictionaries, where each dictionary
represents an amenities table with subcategories
and their corresponding details as key-value
pairs; otherwise, `None`.
Notes:
- The function assumes the structure of the HTML
remains consistent.
- Each `tr` element with class `Table__TableRow-sc-latbb5-0`
represents a row in the table.
- Each `div` with class `Text__TextBase-sc-13iydfs-0-div
Text__TextContainerBase-sc-13iydfs-1 htNosZ icHjbr`
represents a subcategory.
- Each `span` with class `sc-9be18632-0 bEFlof` represents
the details for that subcategory.
"""
# Find all amenities tables in the soup object
amenities_tables = soup.find_all(
'div',
{'data-testid': 'structured-amenities-table-category'}
)
# Initialize an empty list to store all table data
all_tables_data = []
# Iterate over each amenities table found
for table in amenities_tables:
# Find all rows in the current table
rows = table.find_all(
'tr',
class_='Table__TableRow-sc-latbb5-0'
)
# Initialize a dictionary to store data for the current table
table_data = {}
# Iterate over each row in the table
for row in rows:
# Find the subcategory element in the row
subcategory_tag = row.find(
'div',
class_='Text__TextBase-sc-13iydfs-0-div '
'Text__TextContainerBase-sc-13iydfs-1 htNosZ icHjbr'
)
# Extract the subcategory text, if the tag exists
subcategory = (
subcategory_tag.get_text(strip=True)
if subcategory_tag
else None
)
# Find all details elements in the row
details_tags = row.find_all(
'span',
class_='sc-9be18632-0 bEFlof'
)
# Extract text from all details tags
details = [
detail.get_text(strip=True)
for detail in details_tags
]
# Add the subcategory and its details to the table data
table_data[subcategory] = details
# Append the current table data to the list of all tables
all_tables_data.append(table_data)
# Return the list of all tables if not empty, otherwise return None
return all_tables_data if all_tables_data else None
In this segment of web scraping, the intended function of this code is supposed to grab the structured information about any amenities attached to a property listing placed on Trulia. To do so, it scans HTML content in search of sections that comprise amenity tables using div elements that have a set data-testid attribute equal to 'structured-amenities-table-category'. For each of these divisions, the function then hunts for individual rows. Individual rows are represented by tr elements with a certain class assigned to them. In any row, there is a subcategory and its details. The function hunts for subcategories by getting div elements that possess specified classes. It retrieves corresponding details for each of the retrieved subcategories. The details were stored in span elements tagged with a particular class. All these details are stored inside a dictionary with the name of the subcategory being the key and the associated details being the value inside a list. This is done for all the amenities table found on the page. The result is a list of dictionaries, where every dictionary represents a table of amenities. If no tables have been found, the function will return None. This structured approach allows the scraper to extract all information related to the amenities for each property.
Extracting Tax Details
def extract_tax_details(soup):
"""
Extract tax details from the HTML content.
This function searches for specific rows in the HTML
structure that contain information about the
year, tax, and assessment. It retrieves the text
content of the corresponding cells and stores them
in a dictionary.
Args:
soup (BeautifulSoup): A BeautifulSoup object
containing the parsed
HTML content.
Returns:
dict or None: A dictionary containing the tax details
with keys "Year", "Tax", and "Assessment"
and their corresponding values as strings;
otherwise, `None` if the details are not found.
Notes:
- The function looks for specific headers
("Year", "Tax", "Assessment") in the HTML table
to find the relevant data.
- Uses `find_next('td')` to get the text from the
cell immediately following the header.
- Returns `None` if no relevant tax details are found
in the HTML.
"""
tax_details = {}
year_row = soup.find('th', string="Year")
if year_row:
tax_details["Year"] = year_row.find_next('td').get_text(strip=True)
tax_row = soup.find('th', string="Tax")
if tax_row:
tax_details["Tax"] = tax_row.find_next('td').get_text(strip=True)
assessment_row = soup.find('th', string="Assessment")
if assessment_row:
tax_details["Assessment"] = (
assessment_row.find_next('td').get_text(strip=True)
)
return tax_details if tax_details else None
This segment of web scraping identifies the tax-related information associated with a property listing from Trulia. The function checks within the HTML content to determine specific rows of the HTML table that carry major details including year, amount in taxes, and assessment. It first searches for the header cell that contains "Year", and upon finding it, it reads the text in the following cell-that contains the year-using find_next('td'). Similarly, it uses the same method to search for the "Tax" and "Assessment" headers, reading the values of the tax from the adjacent cells. All these details are stored in a dictionary where keys are "Year", "Tax", and "Assessment" and the values are their corresponding values. If none of the tax details are found, the function returns None. This enables the scraper to collect and structure tax-related information for each property listing.
Main Scraping Process
# Main scraping process
def scrape():
"""
Main function to manage the scraping process.
This function first ensures that the necessary tables
in the database are created using `create_tables()`.
It then enters a loop to repeatedly get URLs to scrape,
make requests to these URLs, and extract data from
the HTML content. For each URL:
- Makes a GET request to fetch the HTML content.
- Uses BeautifulSoup to parse the HTML.
- Extracts relevant data such as home name, location,
price, mortgage, specifications, description,
highlights, amenities, and tax details using the
appropriate extraction functions.
- Saves the extracted data to the database using `save_data()`.
- Updates the URL status to indicate successful scraping using
`update_url_status()`.
- If an error occurs during the process, it saves the URL
and the error message to `failed_urls` table and logs the
failure.
Args:
None
Returns:
None
Notes:
- The function operates in a loop until there are no
more URLs to scrape.
- Each step is wrapped in a try-except block to handle
errors gracefully.
- Extracted data is saved as a tuple to the `product_data`
table in the database.
- Failed URLs and reasons are recorded in the `failed_urls`
table for further analysis or retry.
"""
# Ensure tables are created before scraping
create_tables()
while True:
url = get_next_url()
if not url:
print("No more URLs to scrape.")
break
try:
response = make_request(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data
home_name = extract_home_name_from_html(soup)
location = extract_location_from_html(soup)
price = extract_price_from_html(soup)
mortgage = extract_estimated_mortgage_from_html(soup)
specification = extract_specifications(soup)
description = extract_description_from_html(soup)
highlights = str(extract_home_highlights_from_html(soup)) # Save as string
amenities = str(extract_all_amenities_tables(soup)) # Save as string
tax = str(extract_tax_details(soup)) # Save as string
# Prepare data to be saved
data = (
home_name, location, price, mortgage,
specification, description, highlights,
amenities, tax
)
# Save data
save_data(url, data) # Pass URL along with the data
update_url_status(url, 1) # Mark as scraped
except Exception as e:
save_failed_url(url, str(e))
print(f"Failed to scrape {url}: {e}")
The main scraping function coordinates the entire process of gathering property data from the Trulia website. It starts by ensuring that the necessary database tables are created to store the data. Then, it enters an infinite loop, where it repeatedly fetches URLs that need to be scraped. For each URL, it sends an HTTP GET request to fetch the web page’s HTML content. Once the HTML is retrieved, it uses BeautifulSoup to parse and extract specific property details, such as the home name, location, price, mortgage estimate, specifications, description, highlights, amenities, and tax information. Each of these details is extracted using the relevant functions that target specific HTML elements. After extracting the data, it is packaged into a tuple and saved into the database. The status of the URL is updated to indicate that the scraping was successful. If any errors occur during the scraping of a URL (for example, if a request fails or the data cannot be extracted), the URL and the error message are logged in a "failed_urls" table for further review or retry. The process continues until there are no more URLs to scrape. Each step is wrapped in a try-except block to handle errors and ensure the process runs smoothly even if some URLs fail.
Starting the Scraper
# Start the scraper
if __name__ == "__main__":
"""
Entry point for the scraper script.
This script initializes the scraping process by
calling the `scrape()` function. It is the starting
point when running the script directly. The `scrape()`
function handles the entire scraping workflow, including
checking URLs, making requests, extracting data, saving
to the database, and handling errors.
Args:
None
Returns:
None
Notes:
- Ensure that all necessary configurations
(e.g., database connection, user agents, proxies)
are correctly set up before running this script.
- The script runs indefinitely until all URLs to
be scraped are processed.
"""
scrape()
The entry point of the script is located within the if name == "__main__": block, which ensures that the script runs only when executed directly, rather than being imported as a module. When the script is executed, it calls the scrape() function to begin the web scraping process. This function controls the entire workflow of scraping data from Trulia, including checking for URLs to scrape, making requests to fetch the web pages, extracting the relevant data from those pages, saving the data to the database, and handling any errors that may occur. The script runs continuously, processing URLs one after another until all URLs have been scraped. Before running the script, it is important to ensure that all configurations, such as database connections, user agents, and proxies, are correctly set up to ensure smooth operation. The script will keep running indefinitely until all tasks are completed.
Libraries and Versions
This code utilizes several key libraries to perform web scraping and data processing. The versions of the libraries used in this project are as follows: BeautifulSoup4 (v4.12.3) for parsing HTML content, Requests (v2.32.3) for making HTTP requests and Playwright (v1.47.0) for browser automation. These versions ensure smooth integration and functionality throughout the scraping workflow.
Conclusion
With this comprehensive documentation, collecting Trulia data relating to real estate is robust yet systematic. With a two-step process such as gathering all the product URLs before extracting extensive details of each property, large scale data gathering comes out properly.
Key strengths of this implementation include:
A resilient scraping mechanism with built-in error handling and retry capabilities
Database-backed storage ensuring data persistence and scraping progress tracking
Anti-detection measures including proxy support, randomized user agents, and request delays
Comprehensive data extraction covering property details from prices to amenities
Modular design with separate functions for different aspects of data collection
This solution balances efficiency with responsible scraping practices. It introduces delays between requests, as well as using rotating user agents. The integration of the SQLite database allows it to continue progressing even in the event that the scraping process has been halted. It can log failed URLs easily and allow the retrying of problematic cases.
This can be available to developers and researchers to fetch data for real estate markets. Its adaptability to certain specifications depends on requirements; however, proper usage within guidelines from the house rules regarding scrapings for site performance.
AUTHOR
I’m Ambily, Data Analyst at Datahut. I specialize in developing automated data workflows that transform scattered web data into structured, decision-ready insights—particularly in real estate, e-commerce, and pricing intelligence.
At Datahut, we’ve spent over a decade helping businesses harness the power of web scraping to uncover market trends, track competitor listings, and streamline research. In this blog, I’ll walk you through how to automate real estate data extraction from Trulia using Python, saving hours of manual effort while giving you real-time property insights.
If you’re looking to build a scalable data pipeline for your real estate research, reach out via the chat widget on the right. We’d love to support your journey.