How to Use Web Scraping to Track GDPR Fines and Enforcement Cases
- Anusha P O
- Oct 22
- 23 min read
Updated: 2 days ago
In today’s digital world, organizations handle massive amounts of personal data, and protecting that information has become a serious responsibility. Data protection is no longer just a legal phrase tucked away in regulations—it has become a practical necessity for organizations in today’s digital world. With the introduction of the General Data Protection Regulation (GDPR) in 2018, companies across Europe and beyond have been held to higher standards in handling personal data. When these standards are not met, the consequences often come in the form of fines and penalties.
Tracking these enforcement actions is made easier by the GDPR Enforcement Tracker, an online database maintained by CMS (CMS.Law). It collects and organizes details of fines issued by data protection authorities across the European Union (EU) and the European Economic Area (EEA). Instead of just learning GDPR theory, the tracker shows how the law is applied in real situations—highlighting real cases, real penalties, and real lessons.
Each case tells a story: which country issued the fine, the type of violation, the amount imposed, and the GDPR article involved. Fines range from smaller penalties against individuals or local organizations to multimillion-euro actions against global tech companies. This makes the tracker a valuable resource for understanding enforcement patterns, seeing where regulators are most active, and identifying common compliance mistakes.
While not every fine is made public, the database provides a structured and regularly updated overview. Its columns—ETid, Country, Date of Decision, Fine, Controller or Processor, Quoted Article, Type, and Source—turn individual cases into a dataset that can reveal broader insights.
In this blog, we will explore how to turn this rich online resource into a usable dataset. From scraping the data, handling missing values, to cleaning and structuring it for analysis, this step-by-step guide will demonstrate how even a large, multi-page table of enforcement cases can be transformed into actionable information. By the end, readers will understand not only the mechanics of data extraction but also the insights this dataset can reveal about GDPR enforcement across Europe.
Smarter Way to Collect Data
When dealing with large websites filled with structured information, manually copying data is simply not practical. Imagine scrolling through hundreds of enforcement cases, each listing details like the country, the decision date, the fine amount, and the GDPR article cited. Doing this by hand would take hours and still risk mistakes. This is where web scraping offers a far smarter and more reliable way forward.
The GDPR Enforcement Tracker website is a good example. It publishes detailed records of data protection enforcement across Europe, but the information sits in long tables spread across multiple pages. Instead of treating it like just another static webpage, a scraper can transform it into a structured dataset. Every row of the table—containing the case ID, the organization involved, the penalty imposed, and even the link to the official source—gets extracted automatically.
Step 1: Collecting Data from a Paginated Table
The only step in this scraping project is figuring out where the data lives and how it is structured. In this case, the GDPR Enforcement Tracker website presents information in a tabular format. Each row of the table represents an enforcement case, containing details such as the case ID, the country, the date of the decision, the fine amount, the type of violation, and even a link to the original source.
Now, the table isn’t just a single page—it actually holds thousands of entries (2,839 in this case). To make things manageable, the site uses pagination: only a portion of the data is visible at once, and a “Next” button at the bottom of the page lets you move forward. While this makes browsing easier for humans, it adds an extra layer of complexity for a scraper. The scraper not only has to read the data from the current page but also click “Next” repeatedly until all entries are collected.
Think of it like flipping through a big book with multiple chapters. You don’t want to read just the first chapter—you want the entire story. In the same way, the scraper patiently goes through every page of the table, ensuring no case is left behind. Once this step is complete, you have a comprehensive list of all enforcement cases ready to be stored and analyzed.
Step 2: Cleaning the Data
Once all the rows have been scraped and stored, the next challenge is ensuring the data is clean and reliable. Raw data from websites often looks neat at first glance, but when you dig deeper, you’ll notice small issues—missing values, inconsistent formatting, or incomplete entries. These problems may seem minor, but they can make analysis difficult later on.
In the GDPR Enforcement Tracker dataset, most of the values were well-structured because the source itself is a clean table. However, a few columns had gaps. For example, some cases did not list the exact fine amount, while others had the quoted GDPR article marked as “Unknown”. These missing or unclear values needed attention before moving ahead. So fill the missing values with “N/A”.
For this task, OpenRefine proved handy. It’s a beginner-friendly tool designed for cleaning and organizing messy data. Think of it like a workshop table where you can polish and fix raw pieces before building something useful. With OpenRefine, it was easy to standardize entries and handle missing values.
Of course, when dealing with much larger files or more complex cleaning tasks, Python libraries like pandas are often better suited. Pandas lets you programmatically check for gaps, replace missing values, and even reformat entire columns with just a few lines of code. But for the current dataset, OpenRefine did the job effectively.
In short, this step ensures the dataset is trustworthy. Clean data means fewer headaches later, whether the goal is statistical analysis, visualization, or sharing insights with others.
Essential Tools for Smooth and Efficient Data Extraction
When it comes to collecting structured data from websites, the right combination of tools can make the process smooth, reliable, and surprisingly fast. Think of it as building a small team where each member has a specific role: some handle browsing, others extract information, and a few organize and store the results.
In this setup, Playwright takes the lead. It’s a Python library that can open web pages, navigate through links, click buttons, and even scroll tables—just like a human user would. By automating these actions, Playwright removes the need to manually copy data, which can be tedious and error-prone, especially when the website has thousands of rows or multiple pages.
To store the data efficiently, sqlite3 is used. It’s a lightweight database that organizes information neatly without the need for a separate server. Each scraped row—like case ID, country, fine amount, and source link—can be saved safely and retrieved whenever needed. Meanwhile, logging keeps track of every action and any errors that occur, providing a clear trail of what the scraper did and helping debug issues if something goes wrong.
Other supporting tools also play key roles. The time module introduces small pauses to ensure web pages load properly, re helps clean or extract specific patterns from text, and urljoin ensures all links are converted into absolute URLs, so nothing gets lost during navigation.
Altogether, this combination forms a resilient and organized toolkit for web scraping. Each library contributes its strength—automation, data storage, logging, or text processing—allowing even large-scale extraction from dynamic, paginated websites to be handled efficiently and accurately. Using these tools together transforms a static webpage into a structured, analyzable dataset ready for deeper insights.
Importing the Right Tools
Before we dive into any kind of scraping or automation, we first need to set up the tools that will help us do the job. Think of this step as gathering all the ingredients before you start cooking. If something is missing, you’ll get stuck halfway through the recipe.
In Python, these “ingredients” come in the form of libraries. Each library brings its own special abilities to the table, and together, they make our project possible. Let’s walk through the ones we’re using here and why they matter.
# Importing libraries
from playwright.sync_api import sync_playwright
import sqlite3
import logging
import time
import re
from urllib.parse import urljoin
"""Dependencies:
- Playwright: For browser automation.
- sqlite3: For local database management.
- logging: For logging actions and errors.
- time & urllib.parse: For delay handling and URL joining.
- re: Regular expressions (optional usage)."""
The first one, Playwright, is the star of our setup. It allows us to control a browser automatically, almost like a robot clicking buttons and navigating pages for us. With Playwright, we can open websites, scroll through them, and capture information—without having to do everything manually.
Next comes sqlite3, which is a built-in library in Python. Think of it as a small notebook where we can neatly store all the data we collect. Instead of having everything scattered in memory, we’ll keep it safe in a local database file that we can query later.
The logging library is like our personal diary for this project. It keeps track of what happens while the code runs—whether things are going smoothly or if any errors pop up. This makes troubleshooting much easier, especially when the project grows bigger. Then we have time, which we’ll use to add delays. Sometimes, when scraping, rushing through pages too quickly can raise suspicion or even block us from accessing the site. Adding small pauses makes our automation look more natural.
The re library handles regular expressions. You can think of it as a magnifying glass we use to spot patterns in text. For example, if we want to pick out numbers or clean up messy strings, regex helps us do it. Lastly, urljoin from urllib.parse comes into play when we need to handle web links. Websites often provide relative URLs, and this tool helps us turn them into complete, usable links.
By combining all these libraries, we get a powerful yet lightweight toolkit: Playwright for automation, sqlite3 for storage, logging for tracking, time for delays, re for pattern matching, and urljoin for building links. With these in place, we’re ready to move on to the next stage of our project.
Setting Up Logging
Once we have our tools imported, the next important step is to make sure we keep track of what our program is doing. Imagine running a long scraping script overnight and waking up in the morning only to see it crashed halfway through. Without a record of what happened, it would feel like trying to solve a mystery with no clues. This is exactly why logging is so useful.
In Python, logging acts like a journal for your program. Every time something important happens—whether it’s a success, a warning, or an error—it writes an entry into a log file. Later, when you want to understand how your scraper behaved, you can just read through that file.
# Logging Setup
logging.basicConfig(
filename="gdpr_scraper.log",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
"""
Logging Setup Configuration
This section configures the logging module to record events and errors during the scraping process.
- Logs will be saved to the file named "gdpr_scraper.log".
- Logging level is set to INFO, meaning that INFO, WARNING, ERROR, and CRITICAL messages will be recorded.
- Log messages will include:
- Timestamp of the log entry (when the event occurred).
- Logging level (INFO, ERROR, etc.).
- The actual message describing the event or error.
Purpose:
The log file helps in tracking the progress of the scraper and diagnosing any issues that occur during execution.
For example, each time a row is successfully saved or an error happens during scraping, it will be logged."""
With this configuration, all log messages will be saved into a file called gdpr_scraper.log. This file acts like a black box recorder for your scraper—it captures everything you tell it to.
We’ve set the logging level to INFO, which means the file will record not just important errors, but also general progress updates. For example, if the scraper successfully saves a new case to the database, we’ll log that. If it runs into an error, that too will be logged.
The format parameter makes sure each log entry contains three pieces of information:
When it happened (the timestamp).
What type of event it was (INFO, WARNING, ERROR, etc.).
The actual message describing what occurred.
Together, these details make debugging much easier. Instead of staring at your code wondering why it broke, you can open the log file and retrace exactly what happened step by step.
So in short, while logging might look like a small setup, it’s actually one of the most important parts of any scraping project. It helps us stay in control, even when the script is running on its own.
Defining Constants and Preparing the Database
Now that we’ve set up our logging system, the next step is to decide where our scraper will look and where it will store the information it collects. Think of this as setting the destination before starting a journey, and also keeping a notebook ready to record everything you find along the way.
# Constants & DB Setup
URL = "https://www.enforcementtracker.com/"
DB_NAME = "gdpr_enforcement_tracker1.db"
TABLE_NAME = "gdpr_cases"
"""
Constants used for scraping and database setup:
- URL: Source website for GDPR enforcement tracker data.
- DB_NAME: Name of the SQLite database file.
- TABLE_NAME: Name of the table where case data will be stored.
"""In this part, we set three constants to guide our scraper.
URL tells the scraper where to look for data—the GDPR Enforcement Tracker website.
DB_NAME is the name of the local SQLite file where we’ll save everything, like a notebook for our data.
TABLE_NAME is the section inside that notebook where the cases will be stored.
Defining them at the top makes the code cleaner and easier to change later if needed.
Creating the Database Table
Once we know where to store our data, the next step is to prepare the actual space inside the database. Think of it like setting up an empty table in a notebook before you start writing in it. Without that table, you wouldn’t know where each piece of information should go.
# Database Functions
def create_table():
"""Create the SQLite table to store GDPR enforcement cases if it doesn't exist.
1. Connects to the SQLite database specified by DB_NAME.
2. Executes a SQL command to create a table named TABLE_NAME with specified columns if it doesn't already exist.
3. Commits the changes and closes the connection
Table Columns:
- ETid: Unique identifier for the enforcement decision.
- Country: Country where the enforcement was issued.
- DateOfDecision: Date when the decision was made.
- Fine: The fine imposed in the case.
- ControllerProcessor: Name of the controller or processor involved.
- QuotedArt: Quoted GDPR Article relevant to the decision.
- Type: Type of enforcement decision.
- Source: URL of the source document or reference.
"""
conn = sqlite3.connect(DB_NAME)
conn.execute(f"""
CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
ETid TEXT,
Country TEXT,
DateOfDecision TEXT,
Fine TEXT,
ControllerProcessor TEXT,
QuotedArt TEXT,
Type TEXT,
Source TEXT
)
""")
conn.commit()
conn.close()The function create_table() takes care of this preparation. Here’s what it does step by step:
It connects to the database file we defined earlier (gdpr_enforcement_tracker1.db). If the file doesn’t exist yet, SQLite will quietly create it for us.
It runs a command that says: “Make me a table called gdpr_cases with the following columns—ETid, Country, DateOfDecision, Fine, ControllerProcessor, QuotedArt, Type, and Source.”
Each column has a clear purpose. For example, ETid is like a unique ID for the case, Fine records the penalty, and Source points to the official document or link.
Finally, it saves the changes and closes the connection to the database.
By doing this, we’ve set up a neat structure that ensures all the scraped data will land in the right place. If we skip this step, the scraper would have nowhere to save its findings, much like trying to write notes without having a page ready.
Saving Scraped Data into the Database
Once our scraper starts collecting information, the next challenge is deciding how to store it safely. Imagine going to a library, taking notes on dozens of books, but then forgetting to file those notes properly—you’d end up with a mess. That’s why we need a dedicated function to neatly save each piece of data into our database.
# Scraping Functions
def save_row_to_sqlite(row):
"""
Save a single scraped row into the SQLite database.
Args:
row (list): List containing values for the enforcement decision:
[ETid, Country, DateOfDecision, Fine, ControllerProcessor, QuotedArt, Type, Source]
Logs the saved row ETid for traceability.
"""
conn = sqlite3.connect(DB_NAME)
conn.execute(f"""
INSERT INTO {TABLE_NAME}
(ETid, Country, DateOfDecision, Fine, ControllerProcessor, QuotedArt, Type, Source)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", row)
conn.commit()
conn.close()
logging.info(f"Saved row: {row[0]} | source: {row[7]}")The function save_row_to_sqlite() does exactly this. Each row of scraped data, which contains details like the case ID, country, fine amount, and source link, is passed into this function. The function then opens a connection to our SQLite database, inserts the row into the correct table, and closes the connection once the job is done.
What makes this step especially useful is the logging. Every time a row is successfully saved, the scraper records the event in the log file with the case ID and source link. Think of this like a “receipt” for each entry—it helps us trace what has already been stored and makes it easier to debug if something goes wrong.
By separating the saving process into its own function, we also make the code cleaner and more reusable. Anytime we want to save new rows, we don’t have to rewrite the database commands—we just call this function. It’s like setting up a reusable storage box where every new record automatically finds its place.
Extracting the Source Link from Each Case
When we scrape the GDPR Enforcement Tracker website, each row in the table represents a case. Along with details like the fine amount or the country, there’s often a source link—a URL pointing to the official decision or a related document. Capturing this link is important, because it allows us to trace back to the original source for verification or deeper reading.
# Scraper Logic
def extract_source_from_tr(tr, etid_href=None):
"""
Extract the source URL from a table row element representing a GDPR enforcement case.
This function attempts to find a relevant hyperlink (source document or reference)
associated with the enforcement case in a table row (`tr`) by following these steps:
1. First, it looks for an anchor (<a>) tag with the class 'blau', which is typically
used for the primary source link on the webpage.
- If found, it processes the href attribute:
- Converts protocol-relative URLs (starting with "//") to absolute by adding "https:".
- Converts relative URLs (starting with "/") into absolute URLs using the base URL.
- Returns the absolute URL directly if it’s already complete.
2. If no 'blau' class anchor is found, the function searches through all anchor (<a>) tags in the row.
- It skips the ETid link (provided as `etid_href`) to avoid using it as the source.
- It returns the first link that:
- Starts with "http" (absolute URL), or
- Ends with ".pdf" (document file link).
3. If no suitable link is found, it returns an empty string.
Args:
tr (ElementHandle): A Playwright ElementHandle representing one table row (<tr>) of the GDPR enforcement cases table.
etid_href (str, optional): The ETid hyperlink (href) to skip, typically the link to the case itself.
Returns:
str: The absolute URL of the source document or reference if found, otherwise an empty string.
Notes:
- Uses urljoin to handle relative URLs.
- Logs any unexpected errors during extraction but continues execution.
"""
try:
# Prefer anchor with class 'blau'
a = tr.query_selector("a.blau")
if a:
href = a.get_attribute("href") or ""
href = href.strip()
if href:
if href.startswith("//"):
return "https:" + href
if href.startswith("/"):
return urljoin(URL, href)
return href
# Fallback: first anchor that isn’t the ETid link
anchors = tr.query_selector_all("a")
for link in anchors:
href = (link.get_attribute("href") or "").strip()
if not href or href == etid_href:
continue
if href.startswith("http") or href.endswith(".pdf"):
return href
except Exception as e:
logging.error(f"extract_source_from_tr error: {e}")
return ""That’s where the function extract_source_from_tr() comes in. Its job is simple but essential: look at a single table row on the webpage and try to find the right link to save. Here’s how it works. First, the function checks if the row has a special link marked with the CSS class "blau". On this website, that usually means the main source link. If such a link exists, the function carefully examines its format. Some links may be written in shorthand (like starting with // or /), so the function fixes them by adding the base website address to make them full, usable URLs. If the link is already complete, it just returns it as is.
But what if the "blau" link isn’t there? In that case, the function doesn’t give up. It goes through all the other links in the row, skipping over the case ID link (since that just points back to the case itself). From the remaining links, it picks the first one that looks valid—either a regular website link (starting with http) or a document link (ending with .pdf). And if nothing useful is found, the function simply returns an empty string. This way, the scraper doesn’t break; it just moves on, while also recording the error in the log file for later review.
In plain terms, this function acts like a careful librarian. It goes through the details of each case, finds the most trustworthy source link, cleans it up if needed, and hands it over to be saved. Without this step, our scraper would have the basic case details but no direct trail to the official documents—which would be like having book summaries without being able to check the actual books.
Scraping GDPR Enforcement Cases from a Webpage
Now that we have our database ready, let’s move on to the main part of the project: actually scraping the data from the Enforcement Tracker website. This is where our function scrape_page(page) comes in. Think of this function as a worker that looks at one page of the site, reads through the table of cases, and carefully copies each row into our database.
# Main Scraper Logic
def scrape_page(page):
"""Scrape table data from the current page, row by row, saving each row to DB
Scrape all GDPR enforcement cases from the current page of the enforcement tracker table.
This function performs the following actions:
1. Selects all rows from the penalties table on the current page.
2. For each row:
- Extracts the following information:
- ETid: The unique case identifier.
- Country: Country where the enforcement was issued (from an image 'alt' attribute).
- DateOfDecision: Date the decision was made.
- Fine: The fine amount imposed (as displayed).
- ControllerProcessor: Name of the controller or processor involved.
- QuotedArt: The GDPR article cited in the decision.
- Type: Type of the enforcement case.
- Source: Link to the source document or relevant reference.
- Handles missing data gracefully to avoid breaking the process.
- Saves the extracted row data into the SQLite database.
3. Logs the number of rows found and any errors encountered while processing rows.
Notes:
- Uses the helper function `extract_source_from_tr(tr, etid_href)` to extract the source URL.
- Each row is saved to the database using `save_row_to_sqlite(row)`.
- Any parsing or saving error for individual rows is logged, but the process continues for remaining rows.
"""
table_rows = page.query_selector_all("table#penalties tbody tr")
logging.info(f"Found {len(table_rows)} rows on current page")
for idx, tr in enumerate(table_rows, start=1):
try:
tds = tr.query_selector_all("td")
# Map according to actual HTML structure you provided
etid_el = tds[1].query_selector("a") if len(tds) > 1 else None
etid = etid_el.inner_text().strip() if etid_el else ""
etid_href = (etid_el.get_attribute("href") or "").strip() if etid_el else ""
country = ""
if len(tds) > 2:
country_img = tds[2].query_selector("img")
if country_img:
country = country_img.get_attribute("alt").strip()
date_decision = tds[4].inner_text().strip() if len(tds) > 4 else ""
fine = tds[5].inner_text().strip() if len(tds) > 5 else ""
controller = tds[6].inner_text().strip() if len(tds) > 6 else ""
quoted_art = tds[8].inner_text().strip() if len(tds) > 8 else ""
case_type = tds[9].inner_text().strip() if len(tds) > 9 else ""
source = extract_source_from_tr(tr, etid_href=etid_href)
row = [etid, country, date_decision, fine, controller, quoted_art, case_type, source]
save_row_to_sqlite(row)
except Exception as e:
logging.error(f"Error parsing/saving row {idx}: {e}")The Enforcement Tracker website organizes GDPR penalty cases in a big table. Each row of this table is like a record card, holding information such as the case ID, the country where it happened, the fine amount, the law article quoted, and a source link. What our function does is simple: go row by row, collect all this information, and store it safely.
When the function runs, the first thing it does is find all the rows inside the penalties table. Imagine you have a stack of papers, and you ask, “How many papers are in this stack?” That’s what the logging.info line does—it tells us how many rows (or “papers”) were found on the current page. This is useful because it helps us track whether the website structure has changed or whether the scraper is working as expected.
Next, the function goes through each row one at a time. For every row, it looks for specific pieces of information. For example:
The ETid, which is like a case number.
The country, which is taken from the little flag image shown in the table.
The date of decision, so we know when the ruling was made.
The fine amount, which is the headline figure most people care about.
The controller or processor, which is just the company or entity that got penalized.
The quoted GDPR article, which tells us which law they broke.
The type of case, describing what kind of violation it was.
And finally, the source link, which points to the original legal document or article.
One nice thing about this function is that it has been written carefully to handle missing data. For example, sometimes a field may not exist in a row. Instead of crashing, the scraper just saves it as an empty string and keeps moving. This makes the process robust, like a person who doesn’t stop writing notes just because one page in a book has a smudge. After gathering all the details, the function saves the row into the database using another helper function called save_row_to_sqlite(row). That way, every case we scrape is permanently stored and can be analyzed later without needing to scrape again.
Of course, web scraping can sometimes be unpredictable—maybe a row has unusual formatting, or the page takes too long to load. That’s why the function also has error handling. If something goes wrong while scraping a particular row, it doesn’t stop the whole process. Instead, it logs the error and simply moves on to the next row. This ensures that one bad record doesn’t ruin the entire run.
In short, scrape_page(page) is like a careful note-taker: it goes through each enforcement case, copies down all the important details, and files them neatly into our database. This is the heart of the scraper, because without it, we wouldn’t have any data to analyze later.
Bringing Everything Together: The Main Function
Up until now, we have looked at how to scrape one page at a time and how to store each case in the database. But a scraper is only truly useful when it can run smoothly from start to finish—moving across multiple pages, collecting all the cases, and finally wrapping up neatly. That’s exactly what our main() function is designed to do. You can think of it as the “conductor of the orchestra,” making sure every part of the scraper plays in harmony.
# Main Execution
def main():
"""
Main function to run the GDPR Enforcement Tracker web scraper.
This function performs the following steps:
1. Initializes logging to track the progress and any errors during scraping.
2. Creates the SQLite database table if it doesn’t already exist, to store scraped data.
3. Launches a browser using Playwright in non-headless mode
(so you can see the browser actions for debugging or monitoring).
4. Opens a new browser page and navigates to the GDPR Enforcement Tracker website.
5. Waits for the penalties table to load before starting the scraping process.
6. Repeatedly scrapes all rows of the current penalties table and saves them into the database.
7. Checks for the "Next" button to move to the next page of results:
- If the "Next" button is disabled or not found, the scraper stops.
- Otherwise, it clicks "Next" and continues scraping the next page.
8. After reaching the last page, it closes the browser session.
9. Logs that the scraping process finished successfully.
Purpose:
This function automates data collection from a paginated GDPR enforcement table,
storing each case into a local SQLite database for analysis.
Notes:
- `headless=False` is used for visibility during development/debugging.
- A small time delay is added after loading each page to ensure the table is fully loaded.
- Errors during scraping are logged but do not stop the overall process.
"""
logging.info("Starting GDPR Enforcement Tracker scraper")
create_table()
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto(URL)
page.wait_for_selector("table#penalties tbody tr")
time.sleep(2)
while True:
scrape_page(page)
# Check if "Next" button is disabled
next_button = page.query_selector("a#penalties_next")
if next_button is None or "disabled" in (next_button.get_attribute("class") or ""):
break
else:
next_button.click()
page.wait_for_selector("table#penalties tbody tr")
time.sleep(2)
browser.close()
logging.info("Scraper finished successfully.")The first thing the function does is set up logging. Logging is like keeping a diary of what the program is doing. Every important step—like starting the scraper, saving rows, or finishing the run—is written down. If something goes wrong, these notes help us trace where the problem happened.
Next, the function calls create_table(). This ensures that the SQLite database is ready to receive data. If the table already exists, the function won’t overwrite it—it just makes sure everything is in place. Imagine preparing a filing cabinet before you start sorting papers into it.
Once the database is ready, the scraper launches a browser using Playwright. Here, the browser is started in non-headless mode. That means you can actually see the browser window opening, loading pages, and clicking through results. This is especially helpful for beginners, because you can watch the scraper in action and confirm that it’s working as expected. Later, once you’re confident, you could switch to headless mode for faster and quieter runs.
The browser then navigates to the GDPR Enforcement Tracker website, where all the penalty cases are listed. Before doing anything else, the function waits until the table of penalties has fully loaded. This pause is important—without it, the scraper might try to grab data before the page is ready, which would cause errors. A small delay is also added to give the page extra time to settle.
Now comes the main loop. The scraper repeatedly calls the scrape_page(page) function, which extracts the rows of data from the current page and saves them into the database. Once it finishes a page, it looks for the “Next” button. If the button is disabled or missing, that means we’ve reached the last page and the scraper can stop. Otherwise, it clicks the button, waits for the next page to load, and continues scraping. This cycle repeats until every single page of cases has been processed.
Finally, when the last page is done, the browser closes and the scraper logs a message to say that everything finished successfully. At this point, all the scraped cases are neatly stored in the database, ready for analysis.
In simple terms, the main() function is like a project manager: it sets up the workspace, launches the browser, makes sure each page is scraped in order, and closes everything down once the job is complete. Without it, our scraper would just be a collection of disconnected parts. With it, the whole process runs from start to finish, hands-free.
The Entry Point of the Script
Every program needs a clear starting point, a place where the instructions begin. In our scraper, that role is played by the following block of code:
# Entry Point
if name == "__main__":
main()
"""
Entry Point of the Script
This block ensures that the script’s main function runs only when the script
is executed directly, and not when it is imported as a module in another script.
- `if name == "__main__":`
This condition checks if the script is being run as the main program.
- If true, it calls the `main()` function to start the scraper.
- If the script is imported elsewhere, this block is skipped,
so the `main()` function does not run automatically.
"""At first glance, this might look a little mysterious. But here’s what it really means. When Python runs a file, it sets a special variable called name. If the file is being run directly (like typing python scraper.py in the terminal), this variable is set to "__main__". That’s Python’s way of saying, “This is the main script, go ahead and start from here.” So when the condition if name == "__main__": is true, the program calls the main() function, which in our case starts the entire scraping process.
Why is this helpful? Because sometimes you may want to reuse parts of your script in another project. If you import this file into another Python script, the main() function will not run automatically. Instead, only the specific functions you call will execute. Think of it as a safety switch—it prevents the whole scraper from starting up unexpectedly when you only need one small piece of it.
In simple terms, this block tells Python: “Only run the scraper if this file is opened directly. Otherwise, stay quiet.” It’s a neat little way to keep your code flexible and well-behaved.
Conclusion
Exploring the GDPR Enforcement Tracker through scraping reveals how structured data can turn a static webpage into a rich source of insights. By systematically collecting and cleaning each row of enforcement cases, it becomes possible to see which countries are most active in issuing fines, which types of violations occur most frequently, and how significant the penalties can be—from minor fines to multi-million-euro actions against major organizations. The dataset also highlights the most commonly cited GDPR articles and the variety of entities affected, giving a clear picture of real-world compliance challenges. This approach demonstrates that with careful data gathering and processing, complex regulatory information can be transformed into a format that is easier to analyze, understand, and learn from.
Libraries and Versions
Name: playwright , Version: 1.48.0
Name: sqlite3 (built-in, comes with Python, no separate versioning)
Name: urllib (specifically urllib.parse, Built-in (standard Python library))
Name: plotly (if you’re doing visualization later), Version: 5.24.1
AUTHOR
I’m Anusha P O, Data Science Intern at Datahut. I specialize in building automated data collection pipelines that transform raw online information into structured, analysis-ready datasets.
In this blog, I dive into a crucial topic in today’s digital era — data protection and GDPR enforcement. Organizations across the globe are responsible for handling vast amounts of personal data, and with the General Data Protection Regulation (GDPR) setting strict standards since 2018, non-compliance can lead to serious fines and legal consequences.
To bring this topic to life, I’ll walk you through how we can use the GDPR Enforcement Tracker to collect, clean, and structure real-world enforcement case data. This includes details like fines, countries, violation types, and quoted articles—turning a complex multi-page table into meaningful insights.
At Datahut, we build smart, scalable scraping solutions that power business decisions with reliable data. If your organization wants to leverage public web data for research, compliance tracking, or competitive intelligence, feel free to connect with us through the chat widget. Let’s transform raw data into actionable intelligence.
Related posts:
FAQs
1. Can web scraping be used to track GDPR fines?
Yes. Web scraping can collect publicly available information from regulatory websites, news portals, and official announcements to track GDPR fines issued to companies.
2. Is it legal to scrape GDPR fine data?
Yes, as long as the data is publicly available and no personal or sensitive information is collected unlawfully. Always ensure compliance with website terms of service and applicable data protection laws.
3. How can tracking GDPR fines benefit businesses?
Tracking GDPR fines helps businesses understand common compliance pitfalls, benchmark their own practices, and implement proactive measures to avoid violations and financial penalties.
4. What tools are recommended for scraping GDPR fine data?
Popular tools include Python libraries like Beautiful Soup, Scrapy, Requests, and automated browsers like Playwright or Selenium. These tools help extract structured information efficiently from websites and reports.
5. How can I ensure accuracy and reliability while scraping GDPR fines?
Cross-verify scraped data from multiple sources.
Automate regular updates to capture new fines.
Clean and structure the data to maintain consistency and reliability.


