top of page
  • Writer's picturetony56024

Mastering XPath for Web Scraping: A Step-by-Step Tutorial

Updated: Jan 22, 2023


Mastering XPath for Web Scraping: A Step-by-Step Tutorial

XPath is an essential tool for web scraping, allowing you to select specific elements from an HTML document for extraction. When writing XPaths for web scraping, it is essential to follow best practices to ensure that your XPaths are accurate, efficient, and robust. Here are some best practices for writing XPaths for web scraping, along with examples.


XPath is just one of many tools that can be used for web scraping. It can be handy when the structure of the HTML document is complex or when you want to extract data from specific elements that are difficult to select using other methods. CSS Selectors is an alternative to locating elements on a page with Xpaths. We will write about using CSS Selectors in a later blog and compare them against Xpaths.

The Document Object Model (DOM)


The Document Object Model (DOM) is a programming interface for HTML and XML documents. It represents the structure of a document as a tree of objects, with each object representing a part of the document (such as an element, an attribute, or a piece of text).


In the DOM, each object in the tree is called a node. Each node can have child nodes, which are contained within it, and a parent node, which is the node that contains it. For example, in an HTML document, an element node such as a div the element would be a parent node to any element nodes or text nodes that are contained within it.


The DOM allows you to access and modify the content and structure of a document using programming languages like JavaScript. For example, you can use the DOM to add or remove elements from an HTML document or to change the values of attributes on elements.


By representing the structure of a document as a tree of objects, the DOM provides a way to manipulate the content and structure of a document consistently and logically.


What are Xpaths


We'll begin this discussion with a brief overview of Xpath. Xpath, which stands for XML Path Language, is a query language that allows you to select nodes from an XML document. XML, short for Extensible Markup Language, is similar to HTML but has distinct characteristics.


Many programming languages support XPath Expressions, a helpful skill in your arsenal. Think of it like an SQL query but for XML and HTML documents. Xpaths are an excellent tool for RPA or robotic process automation as well. For example, if you need to fill a password field with any data - you can use Xpath to locate it.


Types of Xpaths - with Example code


There are two types of Xpaths: relative and absolute. Relative Xpaths are specific to a particular element or set of elements. They are defined with respect to the element's position in relation to its parent and ancestor elements. For example, if we have the following HTML:

<html>
  <body>
    <div>
      <p>The First paragraph</p>
      <p>The Second paragraph</p>
      <ul>
        <li> First item</li>
        <li> Second item</li>
      </ul>
    </div>
  </body>
</html>

To select the first paragraph element using a relative Xpath, we could use the following Xpath:

//div/p[1]

This Xpath selects the first p element, a child of the div element.

On the other hand, absolute Xpaths are defined with respect to the root element of the HTML document and start with a forward slash (/). So, for example, to select the first paragraph element using an absolute Xpath, we could use the following Xpath:

/html/body/div/p[1] 

This Xpath selects the first p element, a child of the div element, starting from the root HTML element.


One key difference between relative and absolute Xpaths is that relative Xpaths are more flexible and less likely to break if the structure of the webpage changes. Absolute Xpaths, on the other hand, are more fragile as they are tied to the specific structure of the webpage and are more likely to break if the structure changes. Therefore, it's generally a good idea to use relative Xpaths whenever possible, as they are more robust and easier to maintain.


We will be using Relative Xpaths throughout this tutorial. So I encourage you to copy the HTML snippets on this blog to a text editor and save them as an HTML file, open it using a browser, right-click and open inspect element, then try ctrl + F and test your Xpaths. Nothing beats learning by doing.


A crash course on Xpaths for web scraping


1. Select the element using the element name


Use your browser's "inspect element" feature to inspect the HTML structure of the page you want to scrape. This will help you understand the hierarchy of elements and how they are nested, making it easier to write accurate XPaths.


Using simple XPaths can help you find specific elements in an HTML document quickly and easily without using any advanced XPath features. This can be especially useful if you are new to XPath or working with a simple HTML structure that does not require more complex XPath expressions.


For example, given the following HTML:

<html>
  <body>
    <div class="extract_content">
      <p>The paragraph you need to extract</p>
    </div>
  </body>
</html>.

If you wanted to select the p element, you could use the following XPath:

//div[@class='extract_content']/p

2. Use basic element names and attribute values.


Here is an example of HTML code that demonstrates the use of simple XPaths that use basic element names and attribute values:

<html>
  <body>
    <div class="content">
      <p>The first paragraph </p>
    </div>
    <div class="content">
      <p>The second paragraph</p>
    </div>
  </body>
</html>

If you want to select both div elements with a class attribute of "content," use the following XPath

//div[@class='content']

When you apply the Xpath, it selects the following elements from the code above.

<div class="content">
  <p>The first paragraph</p>
</div>
<div class="content">
  <p>The second paragraph</p>
</div>

3. Use AND and OR operators


The AND and OR operators can be used in XPath expressions to combine multiple conditions and create more efficient and flexible selectors. The AND operator allows you to select elements that match multiple conditions, while the OR operator allows you to select elements that match at least one of multiple conditions.


Here is an example of HTML code that we can use to demonstrate the use of the "and" and "or" operators to combine multiple conditions in a single XPath:

<html>
  <body>
    <div class="content" id="main">
      <p>Hello World</p>
    </div>
    <div class="sidebar" id="secondary">
      <p>Goodbye World</p>
    </div>
  </body>
</html>
3.1 The use of AND operator

The following XPath selects the div containing Hello world.

//div[@class='content' and @id='main']

After applying the Xpath, the resultant HTML element is given below.

<div class="content" id="main">
  <p>Hello World</p>
</div>
The content is selected if both conditions are true; in this case, only the first div satisfies the condition.

3.2 The use of OR Operator

The operator matches if any of the two conditions in the xpaths are true.

The following Xpath expression selects both divs.


//div[@class='content' or @id='seconday']

After applying the Xpath, the resultant HTML element is given below.

<div class="content" id="main">
  <p>Hello World</p>
</div>
<div class="sidebar" id="secondary">
  <p>Goodbye World</p>
</div>

In this case, the class condition is true for the first dive, and the id condition is true for the second div. For that reason, both the divs are selected.


The AND and OR operators can help you create more efficient and targeted XPath expressions, especially when working with large and complex HTML documents. However, it is vital to use these operators carefully and thoroughly test your XPath expressions to ensure they select the correct elements.


4. Use the "contains" function.


The "contains" function is a valuable feature of XPath that allows you to select elements based on the presence of a specific string in their attribute values. This can be especially useful when the attribute values are dynamic or when you are not sure of the exact value of an attribute.


Here is an example of HTML code that demonstrates the use of the "contains" function to match element attributes that have a specific string value:

<html>
  <body>
    <div class="content-main">
      <p>Hello World</p>
    </div>
    <div class="sidebar-content">
      <p>Goodbye World</p>
    </div>
  </body>
</html>

We can use the contains function if we need to select both divs. See the Xpath below.

//div[contains(@class, 'content')]

The XPath above will select the following two div elements because their class attributes contain the string "content": The resultant HTML selection is given below.

<div class="content-main">
  <p>Hello World</p>
</div>
<div class="sidebar-content">
  <p>Goodbye World</p>
</div>

5. Use the "text()" function


The "text()" function is a valuable feature in XPath that allows you to match elements based on their inner text rather than their element name or attribute values. This can be useful when selecting elements that contain specific text but may not have a unique element name or attribute value.


Here is an example of HTML code that demonstrates the use of the "text()" function to match elements based on their inner text:

<html>
  <body>
    <div>
      <p>Hello World</p>
    </div>
    <div>
      <p>Goodbye World</p>
    </div>
  </body>
</html>

Also, see the Xpath expression using the text() function below.

//div[text()='Hello World']

The XPath above will select the following div element because it contains the text "Hello World":

<div>
  <p>Hello World</p>
</div>

It is also worth noting that the "text()" function will only match elements that contain the exact text specified in the XPath expression. If you want to match elements that contain text that starts or ends with a specific string or contains a specific string anywhere within the text, you may need to use the "starts-with" or "ends-with" functions or the "contains" function in combination with the "text()" function.


6. Use "starts-with" and "ends-with" functions


The "starts-with" and "ends-with" functions are helpful features in XPath that allow you to match element attributes that start or end with a specific string. This can be useful when selecting elements with attribute values that contain a specific keyword or phrase but may not match exactly.


Here is an example of HTML code that demonstrates the use of the "starts-with" and "ends-with" functions to match element attributes that start or end with a specific string:


<html>
  <body>
    <div id="main-content">
      <p>Hello World</p>
    </div>
    <div id="secondary-sidebar">
      <p>Goodbye World</p>
    </div>
  </body>
</html>

For example, imagine you have a list of elements with id attributes that are generated dynamically but always start with a specific string. You could use the "starts-with" function to select these elements based on the beginning of their id attribute.


Similarly, suppose you have a list of elements with class attributes that are generated dynamically but always end with a specific string. In that case, you could use the "ends-with" function to select these elements based on the ending of their class attribute.

Example of using Starts with

//div[starts-with(@id, 'main')]

The XPath above will select the following HTML snippet because its id attribute starts with "main":

<div id="main-content">
  <p>Hello World</p>
</div>

Example of using ends with

"//div[ends-with(@id, 'sidebar')]"

The XPath above will select the following div element because its id attribute ends with "sidebar":

<div id="secondary-sidebar">
  <p>Goodbye World</p>
</div>

The "starts-with" and "ends-with" functions can be helpful when working with dynamic or variable attribute values that may not always match exactly. However, it is important to use these functions carefully and thoroughly test your XPath expressions to ensure they are selecting the correct elements.


It is also worth noting that the "starts-with" and "ends-with" functions will only match elements with attribute values that start or end with the exact string specified in the XPath expression. If you want to match elements with attribute values that contain a specific string anywhere within the value, you may need to use the "contains" function instead.


7. Use the "following" and "preceding" axes


The "following" and "preceding" axes in XPath allow you to select elements that are siblings of the current element in the HTML structure. This can be useful when trying to select elements related to the current element but not directly nested within it.


For example, imagine you have a list of elements with a specific class, and you want to select all the elements that come after these elements in the HTML structure. You could use the "following" axis to select these elements based on their position relative to the current element.


Similarly, suppose you have a list of elements with a specific class and want to select all the elements that come before these elements in the HTML structure. In that case, you could use the "preceding" axis to select these elements based on their relative position to the current element.


Here is an example of HTML code that demonstrates the use of the "following" and "preceding" axes to select elements that are siblings of the current element in the HTML structure:

<html>
  <body>
    <p>Foo</p>
    <div class="content">
      <p>Hello World</p>
      <p>Goodbye World</p>
    </div>
    <p>Bar</p>
    <p>Baz</p>
  </body>
</html>

7.1 The use of "following" axes


See the Xpath expression below.

//div[@class='content']/following::p

This expression will select the following two p elements because they come after the div element with a class attribute of "content":

<p>Bar</p>
<p>Baz</p>

7.2 The use of "preceding" axes


See the Xpath expression below.

//div[@class='content']/preceding::p

The XPath will select the following p element because it comes before the div element with a class attribute of "content":

<p>Foo</p>

The "following" and "preceding" axes can be helpful when working with HTML documents with a complex structure and you want to select elements based on their position relative to other elements. However, it is important to use these axes carefully and thoroughly test your XPath expressions to ensure they are selecting the correct elements.


It is also worth noting that the "following" and "preceding" axes will only match elements that are siblings of the current element, meaning they are at the same level in the HTML structure. Therefore, if you want to select elements that are ancestors or descendants of the current element, you may need to use the "ancestor" or "descendant" axes instead.


8. Use the "ancestor," "descendant," and "parent" axes to select elements


The "ancestor," "descendant," and "parent" axes in XPath allow you to select elements that are related to the current element in the HTML hierarchy. This can be useful when selecting higher or lower elements in the HTML structure than the current element.

For example, imagine you have an element with a specific class and want to select all the elements that are ancestors of this element in the HTML structure. You could use the "ancestor" axis to select these elements based on their relationship to the current element.


Similarly, suppose you have an element with a specific class and want to select all the elements that are descendants of this element in the HTML structure. In that case, you could use the "descendant" axis to select these elements based on their relationship to the current element.


Finally, suppose you have an element with a specific class and want to select its parent element in the HTML structure. You could use the "parent" axis to select this element based on its relationship to the current element.


Here is an example of HTML code that demonstrates the use of the "ancestor," "descendant," and "parent" axes to select elements that are related to the current element in the HTML hierarchy:


<html>
  <body>
    <div>
      <p>Foo</p>
      <div class="content">
        <p>Hello World</p>
        <p>Goodbye World</p>
      </div>
      <p>Bar</p>
    </div>
  </body>
</html>

8.1 Using ancestor axes to select an element


See the Xpath expression below.

//div[@class='content']/ancestor::body

The XPath will select the following body element because it is an ancestor of the div element with a class attribute of "content": See the resultant HTML below.

<body>
  <div>
    <p>Foo</p>
    <div class="content">
      <p>Hello World</p>
      <p>Goodbye World</p>
    </div>
    <p>Bar</p>
  </div>
</body>

8.2 Using the Descendant Axes


See the Xpath Expression below

//div[@class='content']/descendant::p

The XPath "//div[@class='content']/descendant::p" will select the following two p elements because they are descendants of the div element with a class attribute of "content":

<p>Hello World</p>
<p>Goodbye World</p>

8.3 Using the Parent Axes


See the Xpath Expression below

//div[@class='content']/parent::div

The XPath will select the following div element because it is the parent of the div element with a class attribute of "content":

<div>
  <p>Foo</p>
  <div class="content">
    <p>Hello World</p>
    <p>Goodbye World</p>
  </div>
  <p>Bar</p>
</div>

The "ancestor," "descendant," and "parent" axes can be helpful when you are working with HTML documents that have a complex structure, and you want to select elements based on their relationship to other elements. However, it is important to use these axes carefully and thoroughly test your XPath expressions to ensure they are selecting the correct elements.


It is also worth noting that the "ancestor," "descendant," and "parent" axes will only match elements that are higher or lower in the HTML hierarchy than the current element.


Test your XPaths in the browser console or a tool like XPath Helper to ensure they select the correct elements. This will help you catch mistakes and fine-tune your XPaths for maximum accuracy.


How to use relative Xpaths with Python to scrape data.


How to use Xpath with lxml to scrape data


import requests
from lxml import html

# Make a request to the webpage
page = requests.get('http://www.example.com')

# Parse the webpage content
tree = html.fromstring(page.content)

# Use relative Xpaths to select specific elements
title = tree.xpath('head/title/text()')
paragraphs = tree.xpath('div/p')

# Print the results
print(title)
print(paragraphs)

This example makes a request to a webpage, parses the content with lxml, and then uses relative Xpaths to select the title element and all p elements that are children of a div element. The results are then printed to the console.


How to use Xpath with BeautifulSoup to scrape data


You can also use BeautifulSoup to select elements using relative Xpaths. Here is an example of how to do this:

from bs4 import BeautifulSoup
import requests

# Make a request to the webpage
page = requests.get('http://www.example.com')

# Parse the webpage content
soup = BeautifulSoup(page.content, 'lxml')

# Use relative Xpaths to select specific elements
title = soup.select_one('head/title')
paragraphs = soup.select('div/p')

# Print the results
print(title.text)
print(paragraphs)

This example works similarly to the previous example but uses BeautifulSoup to parse the webpage content and select elements using relative Xpaths.


You can see a detailed blog on using lxml and beautifulsoup here :Scraping IMDB data using Python BeautifulSoup and lxml

What makes a good Xpath Expression


XPath is an essential tool for web scraping, as it allows you to select the exact elements you want without selecting any additional elements. This is especially important when running scrapers periodically on websites with changing data, as the structure of the page can also change or vary across items being scraped. To ensure that your XPath expression works in all scenarios, it is important to make sure that it is both specific and robust to change.


A good XPath expression should be specific enough to capture only what you need, while still being robust enough to handle changes in the page structure. For example, if an image is inserted or a field such as an author's name is absent, your XPath expression should still work correctly. Additionally, it should be able to handle variations in the page structure across different items being scraped. By ensuring that your XPath expression meets these criteria, you can ensure that your web scraper will continue to work even when changes occur on the website.


Conclusion


Xpath Queries are a powerful tool for web scraping that allows you to easily navigate and select specific elements from a webpage. While learning the basic syntax and functions of Xpaths is important, there are also a few best practices and tips to keep in mind when using them for web scraping.


First, it's always good to navigate and select specific elements from a webpage easily. The idea is to test your Xpath Queries on a small sample of the webpage before running them on the entire site to ensure that you are correctly selecting the elements you want. It's also a good idea to use relative Xpaths whenever possible, as these are less likely to break if the structure of the webpage changes.


With these best practices in mind, you'll be well on your way to effectively using Xpaths to scrape the web and gather the data you need for your projects. Try reading some web scraper code written by good developers who uses XPath expression and try to get the Expression meaning.


Related Reading:



4,293 views0 comments

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page