top of page

How to Scrape Data from Booking.com?

  • Writer: Shahana farvin
    Shahana farvin
  • 4 minutes ago
  • 25 min read

Have you ever wondered how websites like Booking.com show so many hotel details so quickly? What if you wanted to collect that information yourself—automatically? In this post, I’ll walk you through a real project where we scraped hotel details from Booking.com, focusing on stays in San Francisco.


Web scraping is a method used to collect data from websites. Think of it like copying information by hand from a web page only much faster and smarter, because a computer does it for you. This can be incredibly useful when you want to gather data that isn’t easily available in a downloadable format.


For our project, we chose Booking.com because it's one of the largest travel sites, with listings from all around the world. It offers tons of useful information: hotel names, prices, locations, user reviews, and more. Our goal is to collect this data in a clean, organized format that we can later analyze or use for research.


Here’s how we approached it. First, we wrote a script to go through Booking.com’s search results for San Francisco and collect the links to individual hotel pages. Since the website doesn’t show all results on one page, we also handled the “Next” buttons—this process is called pagination. Once we had the list of hotel URLs, we moved on to the second part: visiting each hotel’s page and gathering details like name, address, price, room types, amenities, and user ratings.


To make all this work smoothly, we used a few handy tools. Playwright helped us load the website like a real user would, which is especially useful when pages rely on JavaScript to display content. Then we used Beautiful Soup to read the underlying HTML and pull out just the parts we needed. Finally, we saved all our collected data in a SQLite database, which is a simple and lightweight way to store information.


By the end of this project, we’ll have turned a bunch of regular web pages into a neat, structured dataset—ready for analysis. In the next sections, I’ll guide you step-by-step through the code and logic behind each part, so you can build your own scraper even if you're just starting out.


Url Collection Phase


Now that we’ve set the stage, let’s talk about what this scraper is actually built to do.


This project isn’t just about scraping hotel data for one specific day. Instead, it’s designed to work across multiple date ranges. That means it can automatically go through Booking.com’s search results for different check-in and check-out dates and collect hotel page links for each of those time periods.


So, imagine you want to see which hotels are available in San Francisco from December 23rd to December 24th, then again from the 24th to the 25th, and so on for a whole week. Doing that manually would take a lot of clicking, copying, and pasting. But with this solution, the whole process is automated. You just feed in the dates and destination, and the script takes care of everything—opening the search pages, scrolling through the results, and collecting the links to individual hotel pages.


This step is important because each hotel’s availability and pricing might change depending on the dates. By capturing links across different days, we set ourselves up to gather more accurate and complete data in the next phase.


In the next section, we’ll look at how we actually write the code to do this—step by step.


Import and Logging Setup

from playwright.sync_api import sync_playwright
import sqlite3
import logging
from typing import List, Tuple

# Setup logging for better visibility
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

To get started with building our scraper, we first bring in some important tools—what we call libraries in Python. These are like ready-made toolkits that help us do specific tasks without having to write everything from scratch.


The first one we use is playwright.sync_api. This is the key to our automation—it lets our script behave like a real user visiting Booking.com in a browser. It can click buttons, scroll through pages, and wait for content to load, just like you would.


Next, we have sqlite3, which helps us save the data we collect. Think of it as a tiny, portable database that lives inside a file on your computer. It's perfect for projects like this where we need to store lots of hotel links in an organized way.


We also import logging. This might sound a bit boring, but it's actually very helpful. Logging lets us keep a record of what’s happening while the script runs—like when it starts, what it’s doing, and if anything goes wrong. Each message includes a timestamp so we know exactly when things occurred, which makes troubleshooting much easier later.


Finally, we use typing. This isn’t something the script needs to run, but it’s useful for making our code cleaner and easier to understand. It lets us label the kind of data we expect in different parts of the script, which helps avoid mistakes.


Once we’ve imported everything, we set up the logging system. This makes sure that all messages—like successes or errors—are printed with the date and time, so we always have a clear record of what happened while our scraper was running.


`scrape_booking_links` Function: The Core Scraping Mechanism

def scrape_booking_links(url: str, checkin: str, checkout: str) -> List[Tuple[str, str, str]]:
   """
   Scrape hotel links from Booking.com search results for a specific date range.
   This function uses Playwright to:
   1. Navigate to the Booking.com search results page
   2. Scroll and load more results
   3. Extract unique hotel page links
   Args:
       url (str): Full Booking.com search results URL
       checkin (str): Check-in date in format 'DD/MM/YY'
       checkout (str): Check-out date in format 'DD/MM/YY'
   Returns:
       List[Tuple[str, str, str]]: A list of tuples containing:
       - Hotel page URL
       - Check-in date
       - Check-out date
   Raises:
       Exception: For navigation, scraping, or browser-related errors
   Notes:
       - Uses Chromium in non-headless mode for better debugging
       - Implements scroll and "Load more" strategies to capture more results
       - Adds user agent to mimic real browser behavior
   """
   links = []
   try:
       with sync_playwright() as p:
           # Launch browser with more stable settings
           browser = p.chromium.launch(
               headless=False,
               # Add more browser launch options for stability
               args=[
                   '--no-sandbox',
                   '--disable-setuid-sandbox',
                   '--disable-dev-shm-usage'
               ]
           )
           context = browser.new_context(
               # Add user agent to mimic a real browser
               user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
           )
           page = context.new_page()

           try:
               # Navigate to the page with error handling
               try:
                   page.goto(url, timeout=30000, wait_until='networkidle')
               except Exception as nav_error:
                   logging.error(f"Navigation error: {nav_error}")
                   return links

               # Max attempts to load more results
               max_scroll_attempts = 5
               scroll_attempts = 0

               while scroll_attempts < max_scroll_attempts:
                   # Scroll to bottom and top to trigger lazy loading
                   page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                   page.wait_for_timeout(2000)
                   page.evaluate("window.scrollTo(0, 0)")
                   page.wait_for_timeout(1000)
                   page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
                   page.wait_for_timeout(2000)

                   # Try to find and click "Load more results" button
                   try:
                       load_more_button = page.locator("text='Load more results'").first
                       if load_more_button.is_visible():
                           load_more_button.click()
                           page.wait_for_timeout(3000)
                           scroll_attempts = 0  # Reset attempts if button was clicked
                       else:
                           scroll_attempts += 1
                   except Exception as load_more_error:
                       logging.info(f"No more 'Load more' button or error: {load_more_error}")
                       scroll_attempts += 1

               # Extract hotel links
               hotel_elements = page.query_selector_all(
                   "h3.aab71f8e4e > a.a78ca197d0"
               )
              
               # Capture unique links
               unique_links = set()
               for element in hotel_elements:
                   link = element.get_attribute('href')
                   if link and link not in unique_links:
                       # Ensure full URL
                       if not link.startswith('http'):
                           link = f"https://www.booking.com{link}"
                       unique_links.add(link)
                       links.append((link, checkin, checkout))

               logging.info(f"Scraped {len(links)} unique hotel links")

           except Exception as e:
               logging.error(f"Scraping error: {e}", exc_info=True)
          
           finally:
               # Ensure browser resources are closed
               try:
                   page.close()
                   context.close()
                   browser.close()
               except Exception as close_error:
                   logging.error(f"Error closing browser resources: {close_error}")

   except Exception as setup_error:
       logging.error(f"Playwright setup error: {setup_error}", exc_info=True)
  
   return links

Now we come to the heart of the script—a function called scrape_booking_links. This part does the heavy lifting. It’s in charge of visiting a Booking.com search results page and collecting all the individual hotel links you see listed there.


This function needs three things to get started:

  1. The URL of the search results page

  2. The check-in date

  3. The check-out date


With these inputs, it opens the Booking.com page using a tool called Playwright, which lets our script control a browser just like a human would. We specifically use the Chromium browser (a cousin of Google Chrome) for this task.


To make the browser behave more like a real user, we add a custom user agent—this is like a little ID card that tells websites what kind of device or browser is visiting. By doing this, we reduce the chance of our scraper being blocked or flagged as a bot.


Once the page opens, we use a smart trick called scrolling. Booking.com, like many modern websites, doesn’t load everything at once. It loads more results only as you scroll down—this is known as lazy loading. So, the function scrolls down to the bottom of the page, waits a bit, then scrolls back up and repeats. This pattern encourages the site to load more and more hotel listings each time.


Sometimes, there’s also a “Load more results” button on the page. The function tries to click that button too—several times if needed—but not forever. We set a limit on how many times it can try, just in case something goes wrong or the site behaves differently.


After all the visible hotels are loaded, we extract the links using something called CSS selectors. These are like address labels for elements on the page—in this case, for hotel cards that contain the URLs we want. As the links are collected, we make sure there are no duplicates by using a set (a type of list that only keeps unique items).


Finally, each hotel link is saved along with the date range it belongs to. This way, we know not just which hotel was listed, but when it was available.


`save_links_to_db` Function: Persistent Storage

def save_links_to_db(links: List[Tuple[str, str, str]], db_name: str = "booking_links.db"):
   """
   Save scraped hotel links to a SQLite database with duplicate prevention.
   This function:
   1. Creates a SQLite database if not exists
   2. Creates a 'links' table to store unique hotel URLs
   3. Inserts new links with their associated dates
   4. Prevents duplicate entries
   Args:
       links (List[Tuple[str, str, str]]): List of tuples containing:
           - Hotel page URL
           - Check-in date
           - Check-out date
       db_name (str, optional): Name of the SQLite database file.
           Defaults to "booking_links.db".
   Returns:
       None
   Raises:
       sqlite3.Error: For database connection or insertion errors
   Notes:
       - Uses INSERT OR IGNORE to prevent duplicate links
       - Logs the number of newly inserted unique links
   """
   try:
       conn = sqlite3.connect(db_name)
       cursor = conn.cursor()
      
       # Create table with unique constraint
       cursor.execute('''
           CREATE TABLE IF NOT EXISTS links (
               id INTEGER PRIMARY KEY AUTOINCREMENT,
               url TEXT,
               checkin_date TEXT,
               checkout_date TEXT
           )
       ''')

       # Use INSERT OR IGNORE to prevent duplicate entries
       for link, checkin, checkout in links:
           try:
               cursor.execute(
                   "INSERT INTO links (url, checkin_date, checkout_date) VALUES (?, ?, ?)",
                   (link, checkin, checkout)
               )
           except sqlite3.Error as insert_error:
               logging.error(f"Error inserting link: {insert_error}")
      
       # Commit changes and get number of inserted rows
       conn.commit()
       inserted_count = conn.total_changes
       logging.info(f"Inserted {inserted_count} new unique links")

   except sqlite3.Error as db_error:
       logging.error(f"SQLite Error: {db_error}", exc_info=True)
   finally:
       if conn:
           conn.close()

Once we've collected the hotel links from Booking.com, the next step is to save them somewhere safe. That’s where the save_links_to_db function comes in.


This function takes all the links we just scraped and stores them in a small, local database using SQLite. If you’re not familiar with SQLite, think of it like a digital notebook—it lets you store and organize data in tables, just like a spreadsheet, but it works right inside your Python project.


Here’s how it works: if the database file doesn’t already exist, the function will create one. Inside it, it sets up a table called hotel_links. This table includes a few key pieces of information for each hotel link:

  • a unique ID number,

  • the actual URL of the hotel page,

  • the check-in date,

  • and the check-out date.


When it comes time to insert the links, the function uses a smart method called “INSERT OR IGNORE.” This means that if a link has already been saved before, the function won’t insert it again. This helps us avoid having the same hotel link appear multiple times—even if we run the scraper again for the same dates.


After trying to insert all the links, the function counts how many new, unique ones were added to the database. It then logs that number so you can easily keep track of what’s being stored.


Just in case something goes wrong—like a database error—it also includes a safety check. If there’s a problem during the saving process, it will record an error message in the logs instead of crashing the entire script. This makes the function reliable and keeps the data clean and organized.


 `main()` Function: Orchestrating the Scraping Process

def main():
   """
   Main execution function for Booking.com link scraping process.
   This function:
   1. Defines a list of Booking.com search result URLs for different dates
   2. Iterates through URLs to scrape hotel links
   3. Collects all unique links across different date ranges
   4. Saves collected links to a SQLite database
   Args:
       None
   Returns:
       None
    Notes:
       - Handles exceptions during scraping for individual URLs
       - Logs total number of links scraped
       - Warns if no links were scraped
   """
   urls_with_dates = [
       ("https://www.booking.com/searchresults.en-gb.html?ss=San+Francisco&ssne=San+Francisco&ssne_untouched=San+Francisco&label=gen173nr-1BCAEoggI46AdIM1gEaGyIAQGYAQm4ARnIAQzYAQHoAQGIAgGoAgO4AqO5yroGwAIB0gIkMzE4NDYxMTYtNjRlMC00NzQ0LWFhNGYtYzU2YmI4Y2FkMjUw2AIF4AIB&sid=74da2c31c035c6df8deb313a85e24f8e&aid=304142&lang=en-gb&sb=1&src_elem=sb&src=index&dest_id=20015732&dest_type=city&checkin=2024-12-23&checkout=2024-12-24&group_adults=2&no_rooms=1&group_children=0", "23/12/24", "24/12/24"),
       # add more links here..
   ]

   # Total links across all scraping attempts
   all_links = []
   for url, checkin, checkout in urls_with_dates:
       logging.info(f"Scraping: {url}")
       try:
           links = scrape_booking_links(url, checkin, checkout)
           all_links.extend(links)
       except Exception as e:
           logging.error(f"Error scraping {url}: {e}")

   if all_links:
       logging.info(f"Total links scraped: {len(all_links)}")
       save_links_to_db(all_links)
   else:
       logging.warning("No links were scraped.")

At the center of everything is the main() function. You can think of it as the conductor of the script—making sure each part plays its role at the right time to complete the full scraping process smoothly.


This function starts by creating a list of search result URLs from Booking.com, each with different check-in and check-out dates. By covering several date ranges, we make sure our scraper doesn’t miss any hotels that might only appear on specific days. This is especially useful if hotel availability changes often.


Once the list is ready, the script goes through each URL one by one. For every URL, it calls the scrape_booking_links function—the one we discussed earlier—to collect hotel page links. All the links found during each run are added to a single, growing list that holds everything we've gathered so far.


Of course, sometimes things don’t go perfectly. A page might not load correctly, or Booking.com might behave unexpectedly. That’s okay—this function is built to handle such moments gracefully. If there’s an error while working on one of the URLs, it doesn’t stop everything. Instead, it logs the error (so you’ll know what happened), and then it moves on to the next link in the list. This way, one problem doesn’t ruin the entire run.


After the script finishes checking all the URLs, it looks at what it collected. If we ended up with any hotel links, it sends them over to the save_links_to_db function to be safely stored in the database. But if no links were found—maybe because of a technical issue or no availability—it logs a message to let you know.


 Execution Flow

# Run the main function
if __name__ == "__main__":
   main()

If you’re new to Python, this might look a bit strange—but here’s what it does in simple terms: it tells Python, “Only run the main() function if this file is being run directly.” That means, when you open a terminal and run the script, this condition becomes true, and main() is called.


This single line is what kicks off the entire scraping process. It launches the browser, navigates through all the Booking.com pages you’ve specified, scrolls and clicks to load hotel listings, collects the hotel links, and finally saves them neatly into your database.


Without this part, the script would just sit there, defining functions but never actually doing anything. So, think of it as the green light—the command that sets everything into motion.


Data Collection Phase


Now that we’ve gathered hotel links and saved them in a database, the next phase is all about digging deeper—visiting each individual hotel page and collecting detailed information.


This is where our second scraper comes in. It’s a bit more advanced, designed to go through each hotel link stored in the SQLite database, one by one. Using Playwright, the scraper opens each hotel page, waits for the content to fully load (including anything built with JavaScript), and then begins carefully pulling out the key details.


Each of these elements gives us a clearer picture of what the hotel offers and what kind of experience guests might expect. But websites like Booking.com don’t always present data in the same way on every page. Some hotel pages might be missing certain sections—like pricing or a detailed description.


That’s why the scraper is built to handle these situations gracefully. If any piece of information isn’t found on a particular page, it simply marks it as “Not Available” and moves on. This makes the tool reliable and robust, even when the website content isn’t perfectly consistent.


Another smart feature is the tracking mechanism. Once the scraper finishes collecting data from a link, it marks that link as “scraped” in the database. This means if the script stops or crashes partway through, it can resume later from where it left off—without starting over or re-scraping the same pages. It’s an efficient way to manage large-scale data extraction.


So, in short, this scraper doesn't just grab links—it explores each one thoroughly, handles missing content with care, and keeps track of its progress. It’s a powerful way to turn Booking.com’s hotel pages into clean, structured data you can actually work with.


Import Section

import sqlite3
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

As usual, the script starts by importing some built-in Python modules: sqlite3, asyncio, playwright and beautifulsoup.


connect_to_database(db_name) Function

def connect_to_database(db_name):
   """
   Establish a connection to the SQLite database.

   This function creates a connection to the specified SQLite database.
   If the database doesn't exist, it will be created automatically.

   Args:
       db_name (str): The name or path of the SQLite database file.

   Returns:
       sqlite3.Connection: A connection object to the specified database.
   """
   return sqlite3.connect(db_name)

Every good scraping project needs a place to store the data, and for that, we use a database. But before we can start saving anything, we need to set up a connection between our script and the database file. That’s exactly what the connect_to_database function does.


You can think of this function as the first handshake—it opens the line of communication between your scraper and the database where all the hotel information will be stored.


The best part? It’s completely automatic. You don’t need to create the database manually or set anything up in advance. The function checks whether the database file already exists:

  • If it does, it simply connects to it.

  • If it doesn’t, SQLite quietly creates a brand-new file on the spot and gets it ready to use.


This smart design keeps things simple and avoids unnecessary setup steps. As soon as the connection is made, the rest of the scraper can start inserting hotel links and detailed data without worrying about whether the storage space is ready.


So in short, connect_to_database is the gateway to saving and managing your scraped data. It’s a small piece of code that plays a big role in keeping your workflow organized and reliable.


check_and_add_scraped_column(conn) Function

def check_and_add_scraped_column(conn):
   """
   Verify and add a 'scraped' column to the 'links' table if it doesn't exist.

   This function checks the structure of the 'links' table and adds a 'scraped'
   column with a default value of 0 if it's not already present. This column
   is used to track which links have been processed during scraping.

   Args:
       conn (sqlite3.Connection): An active database connection.
   """
   cursor = conn.cursor()
   cursor.execute("PRAGMA table_info(links);")
   columns = [col[1] for col in cursor.fetchall()]
   if 'scraped' not in columns:
       cursor.execute("ALTER TABLE links ADD COLUMN scraped INTEGER DEFAULT 0;")
       conn.commit()

Another important part of this scraper is keeping track of which hotel links have already been processed—and which ones are still waiting to be scraped. That’s where the check_and_add_scraped_column function comes in.


You can think of this function as a helper that keeps our database organized and ready. It looks into the database—specifically the links table—and checks whether there’s a column called scraped. This column acts like a status tag for each link: if a link hasn’t been scraped yet, its value is 0; once it’s been processed, the value is updated to 1.


The smart part is how this function works. Instead of assuming the column already exists, it inspects the table dynamically. In other words, it checks what columns are actually present in the database at that moment. If the scraped column is missing, the function adds it automatically—no need for you to do anything manually.


By doing this, the script becomes much more flexible and reliable, even if the database structure wasn’t perfectly set up in advance. It adapts on its own and ensures that every hotel link can be marked with a clear status.


This simple check gives us a clean and effective way to track progress, avoid re-scraping the same pages, and pick up right where we left off if the script is ever interrupted.


get_unscraped_links(conn) Function

def get_unscraped_links(conn):
   """
   Retrieve all links that have not yet been scraped from the database.

   This function queries the 'links' table to fetch links where the 'scraped'
   column is set to 0, along with their associated check-in and check-out dates.

   Args:
       conn (sqlite3.Connection): An active database connection.

   Returns:
       list: A list of tuples containing (link_id, url, checkin_date, checkout_date)
             for unscraped links.
   """
   cursor = conn.cursor()
   cursor.execute("SELECT id, url, checkin_date, checkout_date FROM links WHERE scraped = 0;")
   return cursor.fetchall()

Once we’ve marked which links have been scraped and which haven’t, the next step is to pull out only the ones that still need to be processed. That’s exactly what the get_unscraped_links function does.


Think of this function as building a to-do list for the scraper. It looks inside the database, finds all the hotel links that still have a scraped value of 0, and prepares them for the next round of data collection.


But it doesn’t just fetch the link alone—it also brings along the check-in and check-out dates tied to each hotel search. This extra bit of information helps the scraper stay smart about context. For example, it knows not just which hotel page to visit, but also which specific dates were used during the search. That kind of detail can be important when displaying prices or room availability.


The result of this function is a list of tuples—each one holding a hotel link along with its associated dates. This list becomes the input for the asynchronous scraping engine, which means the scraper can begin working through the queue, one hotel at a time, without repeating any links or missing key information.


mark_as_scraped(conn, link_id) Function

def mark_as_scraped(conn, link_id):
   """
   Update the status of a specific link to indicate it has been scraped.

   This function sets the 'scraped' column to 1 for the given link ID,
   marking it as processed in the database.

   Args:
       conn (sqlite3.Connection): An active database connection.
       link_id (int): The unique identifier of the link to be marked as scraped.
   """
   cursor = conn.cursor()
   cursor.execute("UPDATE links SET scraped = 1 WHERE id = ?;", (link_id,))
   conn.commit()

Once the scraper finishes gathering data from a hotel page, it needs a way to mark that job as done—and that’s exactly what the mark_as_scraped function does.


This function updates the database to say, “Hey, we’ve already scraped this link.” It does this by setting the scraped column value to 1 for that specific hotel link. That little update may seem simple, but it plays a huge role in keeping the entire scraping process organized and efficient.


Why is this important? Because scraping large websites like Booking.com can take time. Sometimes, the script might stop halfway through due to a network issue, a system reboot, or any number of unexpected reasons. Without this tracking system, you’d either have to start over or risk scraping the same hotel pages again.


But thanks to mark_as_scraped, the scraper knows exactly where it left off. When you run the script again, it will skip over links that have already been processed and continue with the rest—making the scraping workflow resumable and much more reliable.


save_scraped_data(conn, data) Function

def save_scraped_data(conn, data):
   """
   Save the scraped hotel information to the database.

   This function creates a 'scraped_data' table if it doesn't exist and inserts
   the scraped information for a specific hotel, including details like title,
   location, description, pricing, and reviews.

   Args:
       conn (sqlite3.Connection): An active database connection.
       data (tuple): A tuple containing scraped hotel information in the order:
           (url, checkin_date, checkout_date, title, location, location_score,
            description, facilities, room_type, price, sale_price,
            tax_amount, rating, rating_score, review_count)
   """
   cursor = conn.cursor()
   cursor.execute("""
       CREATE TABLE IF NOT EXISTS scraped_data (
           id INTEGER PRIMARY KEY,
           url TEXT,
           checkin_date TEXT,
           checkout_date TEXT,
           title TEXT,
           location TEXT,
           location_score TEXT,
           description TEXT,
           facilities TEXT,
           room_type TEXT,
           price TEXT,
           sale_price TEXT,
           tax_amount TEXT,
           rating TEXT,
           rating_score TEXT,
           review_count TEXT
       );
   """)
   cursor.execute("""
       INSERT INTO scraped_data (
           url, checkin_date, checkout_date, title, location, location_score,
           description, facilities, room_type, price, sale_price,
           tax_amount, rating, rating_score, review_count
       ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);
   """, data)
   conn.commit()

Once all the data has been scraped from a hotel page—names, locations, prices, reviews, and more—it needs to be saved in a safe and structured way. That’s where the save_scraped_data function steps in. It’s the final step in our scraping journey, and it plays a key role in making sure nothing gets lost.


This function creates a new table in the database called scraped_data, if it doesn’t already exist. The table is designed to hold all the rich details we collected from each hotel page—things like hotel names, addresses, descriptions, amenities, room types, prices, and user reviews.


But what makes this function especially useful is its flexibility. It builds the table programmatically, which means the scraper can handle changes in the hotel page layout without needing you to manually update the database structure. If tomorrow Booking.com adds a new field or moves things around, the scraper can still adapt and store the data properly.


With every successful scrape, a new entry is added to the database, turning raw webpage content into organized, usable information. Over time, this builds a valuable dataset that can be used for data analysis, trend tracking, competitor research, or business decisions.


 `parse_title(soup)` Function

def parse_title(soup):
   """
   Extract the hotel title from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       str: The hotel title, or "Not Available" if not found.
   """
   title = soup.select_one("#hp_hotel_name > div > h2")
   return title.get_text(strip=True) if title else "Not Available"

This function is in charge of getting the hotel’s name from the webpage. It works with the parsed HTML content using BeautifulSoup, a tool that helps us navigate and extract data from web pages.


The function looks for the part of the page where the hotel’s name is usually placed—typically inside a specific HTML tag. It does this using a CSS selector, like pointing a finger at the exact spot where the name is expected to be.


But here’s the smart part: websites sometimes change. A hotel name might be missing or placed somewhere unexpected. Instead of crashing or giving an error, this function gracefully handles the situation. If it can’t find the name, it simply returns “Not Available.”


That small fallback keeps the scraper running smoothly, even if the page doesn’t look exactly the same every time. It’s an example of resilient scraping—where the goal is to collect as much data as possible, without breaking when something is missing or slightly different.


Similarly,below given the parsing functions for location, location score, description, room type, price details and ratings.


`parse_location(soup)` Function

def parse_location(soup):
   """
   Extract the hotel location from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       str: The hotel location, or "Not Available" if not found.
   """
   location = soup.select_one("#wrap-hotelpage-top > div:nth-child(4) > div > div > span:nth-child(2) > div")
   return location.get_text(strip=True) if location else "Not Available"

`parse_loc_score(soup)` Function

def parse_loc_score(soup):
   """
   Extract the location score from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       str: The location score, or "not available" if not found.
   """
   loc_score=soup.select_one("#reviewFloater > div.best-review-score.best-review-score-with_best_ugc_highlight.hp_lightbox_score_block > span > span")
   return loc_score.get_text(strip=True) if loc_score else "not available"

 `parse_description(soup)` Function

def parse_description(soup):
   """
   Extract the hotel description from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       str: The hotel description, or "Not Available" if not found.
   """
   description = soup.select_one("#basiclayout > div.hotelchars > div.page-section.hp--desc_highlights.js-k2-hp--block > div > div.bui-grid__column.bui-grid__column-8.k2-hp--description > div.hp-description > div.hp_desc_main_content > div > div > p.a53cbfa6de.b3efd73f69")
   return description.get_text(strip=True) if description else "Not Available"

`parse_roomtype(soup)` Function

def parse_roomtype(soup):
   """
   Extract the room type from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       str: The room type, or "not available" if not found.
   """
   room=soup.select_one("a.hprt-roomtype-link > span.hprt-roomtype-icon-link ")
   return room.get_text(strip=True) if room else "not available"

`parse_prices(soup)` Function

def parse_prices(soup):
   """
   Extract price-related information from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       tuple: A tuple containing (original_price, sale_price, tax_amount),
              with "not available" used if any information is missing.
   ""”        
   price=soup.select_one("div.bui-f-color-destructive.js-strikethrough-price.prco-inline-block-maker-helper.bui-price-display__original")
   sale_price=soup.select_one("span.prco-valign-middle-helper")
   tax_amount=soup.select_one("div.prd-taxes-and-fees-under-price")
   return price.get_text(strip=True) if price else "not available",sale_price.get_text(strip=True) if sale_price else "not available",tax_amount.get_text(strip=True) if tax_amount else "not available"

`parse_ratings(soup)` Function

def parse_ratings(soup):
   """
   Extract rating-related information from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       tuple: A tuple containing (rating, rating_score, review_count),
              with "not available" used if any information is missing.
   """
   rating=soup.select_one("span.a3b8729ab1.e6208ee469.cb2cbb3ccb")
   rating_score=soup.select_one("div.a3b8729ab1.d86cee9b25")
   review_count=soup.select_one("span.a3b8729ab1.f45d8e4c32.d935416c47")
   return rating.get_text(strip=True) if rating else "not available",rating_score.get_text(strip=True) if rating_score else "not available",review_count.get_text(strip=True) if review_count else "not available"

 `parse_facilities(soup)` Function

def parse_facilities(soup):
   """
   Extract hotel facilities from the BeautifulSoup parsed page.

   Args:
       soup (BeautifulSoup): The parsed HTML content of the hotel page.

   Returns:
       str: A comma-separated string of hotel facilities, or an empty string if none found.
   """
   facilities = []
   items = soup.select("#basiclayout > div.hotelchars > div.page-section.hp--desc_highlights.js-k2-hp--block > div > div.bui-grid__column.bui-grid__column-8.k2-hp--description > div.hp--popular_facilities.js-k2-hp--block > div:nth-child(2) > div > div > ul > li")
   for item in items:
       facilities.append(item.get_text(strip=True))
   return ", ".join(facilities)

This function is responsible for gathering the list of facilities or amenities that a hotel offers—things like Wi-Fi, gym access, parking, or air conditioning.


It works by going through the HTML content of the page and looking for the list items that usually contain each facility. Using BeautifulSoup, it loops through each of these items and pulls out the text—one by one.


Once all the facilities are collected, the function joins them into a single string, separating each item with a comma. This makes the final result easy to read and store—like a quick summary of everything the hotel provides.


What’s especially useful about this approach is that it can handle different facility setups. Whether a hotel lists five items or fifteen, the function adjusts accordingly. It turns a messy block of HTML into a clean, readable string that tells you what to expect from your stay.


fetch_page_content(url, page) Function

async def fetch_page_content(url, page):
   """
   Fetch the HTML content of a webpage using Playwright.

   This async function navigates to the specified URL, scrolls to the bottom
   of the page to trigger any lazy-loaded content, and waits for a short time
   to ensure page rendering is complete.

   Args:
       url (str): The URL of the webpage to scrape.
       page (playwright.async_api.Page): An active Playwright page instance.

   Returns:
       str: The fully rendered HTML content of the page.
   """
   await page.goto(url)
   await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
   await page.wait_for_timeout(12000)
   return await page.content()

The fetch_page_content function is like the scraper’s eyes and patience—it doesn’t just open a webpage, it makes sure everything is fully loaded and ready before collecting any data.


Modern websites like Booking.com don’t load all their content in one go. Instead, they load things as you scroll or as needed, using JavaScript. That’s why this function uses Playwright, a powerful tool that controls a real browser, to visit the hotel’s page just like a human would.


Once on the page, the function scrolls to the bottom and then waits. This is a smart trick. The act of scrolling triggers the website to load more information—like photos, room types, or reviews—that wouldn’t appear right away. The wait gives it enough time to finish loading all that data.


By the time the function captures the page content, it’s not just a half-loaded snapshot—it’s the fully expanded version of the hotel page, with everything visible and ready to be parsed. This way, the scraper doesn’t miss out on anything important that might have been hidden during the first few seconds of loading.


parse_content(content) Function

def parse_content(content):
   """
   Parse the HTML content and extract all relevant hotel information.

   This function uses BeautifulSoup to parse the HTML and calls multiple
   parsing functions to extract different pieces of hotel information.

   Args:
       content (str): The HTML content of the hotel page.

   Returns:
       tuple: A tuple containing extracted hotel details in the order:
           (title, location, location_score, description, facilities, room_type,
            price, sale_price, tax_amount, rating, rating_score, review_count)
   """
   soup = BeautifulSoup(content, 'html.parser')
   title = parse_title(soup)
   location = parse_location(soup)
   location_score=parse_loc_score(soup)
   description = parse_description(soup)
   facilities = parse_facilities(soup)
   room_type=parse_roomtype(soup)
   price=parse_prices(soup)[0]
   sale_price=parse_prices(soup)[1]
   tax_amount=parse_prices(soup)[2]
   rating=parse_ratings(soup)[0]
   rating_score=parse_ratings(soup)[1]
   review_count=parse_ratings(soup)[2]
   return title, location,location_score, description, facilities,room_type,price,sale_price,tax_amount,rating,rating_score,review_count

At the heart of the Booking.com Hotel Scraper is the parse_content() function. Think of this as the brain of the operation—the place where raw, messy webpage content gets transformed into clean, structured data you can actually use.


Here’s how it works: once the scraper fetches the HTML of a hotel’s page, parse_content() steps in to make sense of it. It uses BeautifulSoup, a library that helps us dig into the webpage and pick out just the parts we need—like the hotel’s name, location, room details, prices, and more.


But instead of trying to extract everything at once, the function calls on a team of helper functions—each designed to grab one specific piece of data. For example, parse_title() gets the hotel’s name, parse_location() finds the address, and other functions handle things like facilities, reviews, and pricing. Each of these helpers knows exactly where to look and what to collect.


Once all the pieces are gathered, parse_content() combines them into a single, organized bundle—what we call a tuple. This tuple represents a complete snapshot of the hotel’s information and is ready to be saved into the database.


What makes this function especially powerful is its flexibility and resilience. If a hotel page is missing something—say the location score or the description—the function won’t break. Instead, it fills in a placeholder like "Not Available", so the scraper keeps running smoothly. This way, even if some hotel pages are a bit messy or incomplete, the scraper still pulls out as much information as possible.


In the end, parse_content() ensures that every hotel we scrape ends up as a well-structured, information-rich record. And since each part of the function is modular, it’s easy to update or improve one piece without affecting the whole workflow.


scrape_links(db_name) Function

async def scrape_links(db_name):
   """
   Main asynchronous function to scrape hotel links from the database.

   This function performs the following steps:
   1. Connect to the database
   2. Ensure the 'scraped' column exists
   3. Retrieve unscraped links
   4. Launch a Playwright browser
   5. Iterate through links, scraping and saving data
   6. Mark each link as scraped upon successful scraping

   Args:
       db_name (str): The name of the SQLite database containing links to scrape.
   """
   conn = connect_to_database(db_name)
   check_and_add_scraped_column(conn)
   unscraped_links = get_unscraped_links(conn)

   async with async_playwright() as playwright:
       browser = await playwright.chromium.launch(headless=False)
       context = await browser.new_context()
       page = await context.new_page()

       for link_id, url, checkin_date, checkout_date in unscraped_links:
           try:
               print(f"Scraping: {url}")
               content = await fetch_page_content(url, page)
               title, location, location_score, description, facilities, room_type, price, sale_price, tax_amount, rating, rating_score, review_count = parse_content(content)
              
               # Include checkin and checkout dates in the data to be saved
               save_scraped_data(conn, (
                   url, checkin_date, checkout_date, title, location, location_score,
                   description, facilities, room_type, price, sale_price,
                   tax_amount, rating, rating_score, review_count
               ))
              
               mark_as_scraped(conn, link_id)
               print(f"Successfully scraped: {url}")
           except Exception as e:
               print(f"Error scraping {url}: {e}")

       await browser.close()

   conn.close()

The scrape_links() function is like the control center of the entire scraping operation. It pulls everything together—connecting to the database, opening the browser, getting the links, and guiding the scraper through each hotel page, step by step.


This function is asynchronous, which means it can handle tasks in a non-blocking way. That’s useful for scraping because hotel pages can take time to load, and the scraper needs to wait for them without getting stuck.


Here’s what it does: First, it connects to the SQLite database, which holds all the hotel links we plan to scrape. Then, it filters out the ones we've already processed and fetches only the unscraped ones. These are the fresh pages we still need to explore.


After gathering the list, the function launches a Playwright browser—a tool that lets our script browse the web just like a person would. Then, for each hotel link, it loads the page, extracts the data, saves the result, and finally updates the database to mark that hotel as "scraped."


What’s impressive about this function is how it handles errors gracefully. Websites don’t always behave the way we expect—some links may not load, others may be missing data, or the internet connection could drop. If something goes wrong with one hotel page, scrape_links() doesn’t give up or crash. Instead, it logs the issue and moves on to the next one.


This thoughtful design makes the scraper strong and reliable, ready to work through hundreds of hotel pages without breaking. It strikes the right balance between being thorough (by checking every hotel) and being resilient (by continuing even when a few pages have problems). In short, scrape_links() keeps the entire process moving smoothly and efficiently.


Execution Flow

# Run the script
if __name__ == "__main__":
   asyncio.run(scrape_links("booking_links.db"))

The script is designed to be run directly, with the if name == "__main__": block triggering the asynchronous scraping process on a specified database of hotel links.


Conclusion


Throughout this project, we saw how web scraping when done with a clear plan and the right tools, can turn messy website content into clean, usable data. By collecting hotel details from Booking.com, cleaning the results using OpenRefine, and storing everything neatly in an SQLite database, we transformed scattered web pages into structured, meaningful information ready for analysis.


This journey shows that web scraping doesn’t have to be complicated. With some practice and the right mindset, anyone can learn to gather valuable data from websites. Projects like this are a great way to build confidence, sharpen your skills, and see real results. Over time, you’ll find yourself able to take on more advanced scraping tasks, build your own tools, or even feed this data into larger projects whether it’s for research, analytics, or app development.


Contact datahut for all your web scraping needs!


AUTHOR


I’m Shahana, a Data Engineer at Datahut, where I help turn complex, unstructured web data into organized, valuable insights—especially for travel, e-commerce, and hospitality sectors.


At Datahut, we’ve worked with businesses around the globe to automate data collection from websites that are often difficult to scrape, including those that use JavaScript and dynamic loading. In this blog, I’ll walk you through a real-world project where we scraped hotel listings from Booking.com using Playwright, BeautifulSoup, and SQLite. The goal was to collect reliable hotel information for San Francisco over a range of dates handling dynamic content, pagination, and data storage, all in a way that’s scalable and beginner-friendly.


If your team is looking to collect structured travel or accommodation data at scale—or simply wants to learn how scraping can support smarter market insights—feel free to reach out through the chat widget on the right. We’re always happy to explore solutions that fit your data goals.


Frequently Asked Questions (FAQ)


  1. Is it legal to scrape data from Booking.com?

    Scraping publicly available data from Booking.com may violate their terms of service, even if it is not always illegal. It’s important to review their robots.txt file and legal terms before scraping.


  2. What type of data can I extract from Booking.com?

    You can extract hotel names, prices, locations, reviews, ratings, availability, room types, and amenities — depending on your scraping setup and compliance approach.


  3. Which tools are best for scraping Booking.com?

    Tools like Selenium, BeautifulSoup, and Scrapy are often used. Selenium is ideal when dynamic content (JavaScript-rendered) needs to be handled.


  4. Does Booking.com block web scrapers?

    Yes, Booking.com actively uses anti-bot mechanisms such as CAPTCHA, rate limiting, and IP blocking. You need to rotate proxies, use user-agents, and throttle requests.


  5. Can I scrape Booking.com without coding knowledge?

    While DIY scraping requires coding, you can use scraping service providers or no-code tools — but always ensure compliance with Booking.com’s terms.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page