Web scraping with Python: Scraping Property Data from a Real Estate Website

Updated: Aug 22, 2022

If you're in the market for a new apartment and want to see which properties are available, there's a good chance that you've already searched for them online. But even if you have found what looks like your dream home, what if you'd like to see more details about it?

If you do this by manually searching on the website, it will take you forever to find every detail about a property.

That's where web scraping comes in. Web scraping is a process by which you can extract data from websites and transform it into .csv/json files. It can be used for many different purposes, such as analyzing market conditions, understanding what apartments are available at what price points, and much more.

In this blog, we’ll see how to scrape the data from a real estate website. The goal is to extract property data from Amsterdam using Python. The source is Pararius.com, one of Amsterdam's most popular websites for renting and selling property. We'll use Python to scrape the real estate data from this website and save it as a CSV file. Then, we can analyze the data using Excel or another program.

THE PACKAGES

In this tutorial, we will be using python to extract data. We first import BeautifulSoup, requests, writer, time modules, and element tree (tree). These are the necessary packages that are required to extract data from an HTML page.

from bs4 import BeautifulSoup
import requests
from csv import writer
import time
import random
from lxml import etree as et

We first import beautifulsoup. It is a python package that takes data from HTML and XML documents. More about BeautifulSoup below.

The first step is to ping a website for data and make a request. For this, we need the requests module. Requests is the king of all modules in python. All web scraping companies use requests either as it is or build request handlers on top of requests.

We need to write it onto a csv file Once the data is acquired, w. This can be done using the writer module. Pandas is also another popular choice to achieve the same goal but we stick to write package.

Now that we have received our data, we need to halt execution for small periods of time between requests. if we flood the target website with requests - the server will overload, resulting in a service shut down. As a responsible web scraping service - we need to avoid this. That's why we need some time gaps between each HTTP request.

Finally, we need the random module to generate a random time for halting the execution and the element tree(etree) for finding the XPath. If the gap between the requests is similar - it could be used as a way to identify the web scraper. So we make it random using the python random package.

header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.66 Safari/537.36"} 
base_url= "https://www.pararius.com/apartments/amsterdam/page-" 
pages_url=[]    
listing_url=[]

Many websites dislike web scrapers for scraping their data. They block requests that come in without a valid browser as a User-Agent. So we use headers to add user agent to the requests. The requests carry the user agent and other request headers as a payload to the server. In this case - just a user agent works.

The base_url is for getting the page_url of each page on the website. pages_url=[] is an empty list for storing all the links to each page. listing _url[] is also an empty list to store the link to each apartment on each page.

for i in range (1,23): 
    page_url=base_url + str(i)
    pages_url.append(page_url)

We need URLs for every page. And we know the number of pages on the site is 22. So we create a for loop for getting each page_url by concatenating the numbers from 1 to 22 as strings. We store each of these pages in the list pages_url.

def get_dom(the_url):
    response = requests.get(the_url, headers=header)
    soup = BeautifulSoup(response.text,'lxml')
    dom = et.HTML(str(soup))
    return dom

The get_dom() method is to get the dom from the URL. We pass the_url as an argument. We then store the requests from the URL in response. We create the soup variable and also create the dom using the soup. This method returns the dom.

def get_listing_url(page_url):
    dom = get_dom(page_url)
    page_link_list=dom.xpath('//a[@class="listing-search-item__link listing-search-item__link--title"]/@href')
    for page_link in page_link_list:
        listing_url.append("https://www.pararius.com"+page_link)

The get_listing_url() method is used for retrieving the listing_url to each apartment detail.

We get the dom from the get_dom() method by passing the page_url of each page. Using the XPath

//a[@class="listing-search-item__link listing-search-item__link--title"]/@href

To get the xpath, we just need to inspect the element, then press the ctrl + F keys. We now specify the tag and then the class name. If we want the text, we need to specify text(), and if we need the link, we need to specify @href.

We’ll get the link for each apartment's details. These retrieved links are stored in a list page_link_list[]. We then concatenate “https://www.pararius.com” with the page_link in the page_link_list and append all these links to the listing_url.

for i in range (1,23):
    page_url=base_url + str(i)
    pages_url.append(page_url)

For each page_url in the list pages_url, we call the get_listing_url in each page by passing page_url as an argument. We also call the sleep method for halting some random time for executing the code.

The attributes and their xpaths used are listed below.

Attribute	xpath
page_link_list	//a[@class="listing-search-item__link listing-search-item__link--title"]/@href
title	//h1[@class='listing-detail-summary__title']/text()
location	//div[@class='listing-detail-summary__location']/text()
price	//div[@class='listing-detail-summary__price']/text()
area	//li[@class='illustrated-features__item illustrated-features__item--surface-area']/text()
rooms	//li[@class='illustrated-features__item illustrated-features__item--number-of-rooms']/text()
interior	//li[@class='illustrated-features__item illustrated-features__item--interior']/text()
description	//div[@class='listing-detail-description__additional listing-detail-description__additional--collapsed']/p/text()
offered_since	//dd[@class='listing-features__description listing-features__description--offered_since']/span/text()
availability	//dd[@class='listing-features__description listing-features__description--acceptance']/span/text()
specifics	//dd[@class='listing-features__description listing-features__description--specifics']/span/text()
upkeep	//dd[@class='listing-features__description listing-features__description--upkeep']/span/text()
volume	//dd[@class='listing-features__description listing-features__description--volume']/span/text()
type_of_building	//dd[@class='listing-features__description listing-features__description--dwelling_type']/span/text()
construction_type	//dd[@class='listing-features__description listing-features__description--construction_type']/span/text()
constructed_year	//dd[@class='listing-features__description listing-features__description--construction_period']/span/text()
location_type	//dd[@class='listing-features__description listing-features__description--situations']/span/text()
bedrooms	//dd[@class='listing-features__description listing-features__description--number_of_bedrooms']/span/text()
bathrooms	//dd[@class='listing-features__description listing-features__description--number_of_bathrooms']/span/text()
no_floors	//dd[@class='listing-features__description listing-features__description--number_of_floors']/span/text()
balcony	//dd[@class='listing-features__description listing-features__description--balcony']/span/text()
garden	//dd[@class='listing-features__description listing-features__description--garden']/span/text()
is_storage_present	//dd[@class='listing-features__description listing-features__description--storage']/span/text()
storage_description	//dd[@class='listing-features__description listing-features__description--description']/span/text()
is_garage_present	//dd[@class='listing-features__description listing-features__description--available']/span/text()
contact_details	//a[@class='agent-summary__agent-contact-request']/@href

with open('apartments.csv','w',newline='') as f:
    thewriter=writer(f)
    heading=['URL','TITLE','LOCATION','PRICE PER MONTH','AREA IN m²','NUMBER OF ROOMS','INTERIOR','DESCRIPTION','OFFERED SINCE','AVAILABILITY','SPECIFICATION','UPKEEP STATUS','VOLUME','TYPE','CONSTRUCTION TYPE','CONSTRUCTION YEAR','LOCATION TYPE','NUMBER OF BEDROOMS','NUMBER OF BATHROOMS','NUMBER OF FLOORS','DETAILS OF BALCONY','DETAILS OF GARDEN','DETAILS OF STORAGE','DESCRIPTION OF STORAGE','GARAGE','CONTACT DETAILS']
    thewriter.writerow(heading)

We open apartments.csv file in writing mode as f. We use thewriter to write the data into the file. Thereafter, we create a list heading to store the headings of each data and write the first row.

for list_url in listing_url: 
        listing_dom=get_dom(list_url)
        title=get_title(listing_dom)
        location=get_location(listing_dom)
        .............
        information =[list_url,title,location,price,area,rooms,interior,description,offer,availability,specification,upkeep_status,volume,type,construction_type,constructed_year,location_type,bedrooms,bathrooms,floors,balcony_details,garden_details,storage_details,storage_description,garage,contact]
        thewriter.writerow(information)

We get the listing_dom for each list_url from the listing_url by calling the get_dom() method and passing list_url as an argument. We get the title of the apartment by calling the get_title() method and the location by calling the get_location() method.

def get_title(dom):    
    try:               
        title=dom.xpath("//h1[@class='listing-detail-summary__title']/text()")[0][10:-13]  
        print(title)
    except Exception as e:
        title = "Title is not available"
        print(title)
    return title

The get_title() method has the dom as an argument. We use try-except blocks to fetch the data properly. In the try block, we store the title using the XPath. We know that XPath gives us a list of data, but what we need is just the first element of the list. That is why we used [0] indexing, we sliced the string title from the 10th character to the 14th, the last character of the string.

We also have the except block so that the title which we couldn't extract using the XPath can be stored as “Title not available.” We return the string title.

Similarly, we have the following methods, where the xpath to these details only differ from one another:

get_location()
get_price()
get_area()
get_rooms()
get_interior()
get_description()
offered_since()
get_specification()
get_upkeep_status()
get_volume()
get_type()
get_construction_type()
get_construction_year()
get_location_type()
get_bedrooms()
get_bathrooms()
get_no_floors()
get_balcony_details()
get_garden_details()
get_storage_details()
get_storage_description()
is_garage_present()
get_contact_details()

In the file writer, we have a list of variables to be written on a csv file. The list contains:

list_url - The URL to each apartment’s details. get_listing_url() method creates the whole list of links to the apartment on a particular page.
title - The title of the apartment is stored in the title variable. We get the title of each apartment from the get_title() method.
location - The location of the apartment is stored in the location variable. The get_location() method returns the location of the apartment.
price - On calling the get_price() method, we get the price in € of each apartment.
area - get_area() method returns the area in m² of each apartment.
rooms - get_rooms() method returns the number of rooms in each apartment. It is stored in the room variable.
interior - get_interior() method returns the interior status of the apartment. Whether it is furnished, semi-furnished, or upholstered.
description - A short description of the apartment is stored in the description. We get the description by calling the get_description() method.
offer - The offered_since() method returns the date since the apartment is available to book. The returned value is stored in the offer variable.
specification - get_specification() method returns the specification of each apartment. The result is stored in the specification variable.
upkeep_status - The get_upkeep_status() method returns the status of the apartment. The result is stored in the upkeep_status variable.
volume - get_volume() method returns the volume of each apartment, and the returned value is stored inside the volume variable.
type - get_type() method returns the type of the apartment, whether it is a multi-family house or flat etc.
construction_type - get_construction_type() method returns whether the house is an already existing building or a super new house.
contruction_year - get_construction_year() method returns the year in which the construction of the apartment was over.
location_type - get_location_type() method returns the details of the location of the apartment. The result is stored in the location_type variable.
bedrooms - get_bedrooms() method returns the number of bedrooms in each apartment. It is stored in the bedroom variable.
bathrooms - get_bathrooms() method returns the number of bathrooms in each apartment. It is stored in the bathrooms variable.
floors - get_no_floors() method returns the number of floors in each apartment. It is stored in the no_floors variable.
balcony_details - get_balcony_details() method returns whether there is a balcony present in the apartment or not. The result is stored in the balcony_details variable.
garden_details - get_garden_details() method returns whether there is a garden present in the apartment or not. The result is stored in the garden_details variable.
storage_details - get_storage_details() method returns whether there is a storage area present in the apartment or not. The result is stored in the storage_details variable.
storage_description - get_storage_description() method returns the details of the storage area of the apartment. The result is stored in the storage_description variable.
garage - is_garage_present() method returns whether the apartment has a garage present or not.
contact - get_contact_details() method returns the contact details of the retailer who sells the apartment.

We write the details of each of the buildings onto the csv file. The python notebook and the csv file obtained are given below.

Conclusion

We hope you've enjoyed this tutorial and that it's inspired you to start scraping your own data. Our next blog will analyze this property data and show you visualizations. We'll show you how to visualize these properties' property types, prices, and other attributes.

Looking for a reliable web scraping solution for your web data needs? Contact Datahut today!

Web scraping with Python: Scraping Property Data from a Real Estate Website

THE PACKAGES

Conclusion

Recent Posts

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?