top of page

Why Is Web Scraping Essential for Extracting Drug Information from WebMD?

  • Writer: Ambily Biju
    Ambily Biju
  • 20 hours ago
  • 23 min read

Updated: 14 minutes ago


Why Is Web Scraping Essential for Extracting Drug Information from WebMD?

WebMD is the best source of health and medical information, with lots of resources related to health, wellness, and medical problems. The website is also extremely popular due to its well-researched content, expert-approved articles, and beneficial tools for users to find reliable health information. It is one of those trusted sources providing general advice on wellness, in-depth explanations of symptoms, treatments, and other medical procedures. Many people apply it to obtain knowledge relating to various medical issues. Professionals in their field employ it for the latest medical information and insights.


This is where web scraping comes into play. If you're new to it, check out our guide to web scraping with Python to understand the basics.


Web scraping is basically getting information from websites automatically through tools and scripts that operate like an individual browsing on the Internet. It really helps individuals gather a considerable amount of information from the web page without having to copy-paste it all manually. Web scraping can collect both organized and unorganized data, such as product details, prices, reviews, or text. This data can be applied for analysis, reporting, or adding to other systems. Many industries, e-commerce, data analysis, and research, use this for easy understanding of online data. However, it needs to follow the rules, legal guidelines of the site while collecting content.


Choosing the right web scraping service is essential to ensure accuracy and compliance when dealing with healthcare data.


There are two steps to web scraping in this project in order to make them efficient and accurate in collecting drug information pages from WebMD. The first step is the collection of product URLs from WebMD's drug index. This begins with going through the alphabetical index, wherein each letter (A-Z) and number category opens a new webpage with various drugs. For each one of these pages, a scraping tool takes the URLs that link to the individual drug pages. This creates an entire database of product URLs, ready to be used in the next process of gathering data.


SQLite is used as a data storage medium where not only the product links are kept track of, but the processing status is also shown for each of the links. This approach, being systematic, helps us to tackle a tremendous amount of data and revisit the unformatted links whenever required.


Once the URLs are gathered, then the next step is to extract the detailed information from each product page. This is basically getting the essential drug details, such as drug names, generic names, uses, side effects, warnings, precautions, interactions, and overdose information. These important fields from each product page are scraped and then saved in a structured format in an SQLite database. This way, the data collected is well organized, which is good for many tasks later on. Also, any pages that cannot be scraped are noted in a different table, so they can be tried again or looked into more, making sure no data is lost. This two-step method of scraping is crucial to collect a large amount of well-organized, detailed data from complex websites such as WebMD. It provides both speed and a robust way to handle any problems that may arise during the process of scraping.


An Overview of Libraries for Seamless Data Extraction


In the case of the WebMD web scraping project, several types of Python libraries are utilised to assist in bringing together product links and obtaining final data from the scraped pages. Here is a detailed breakdown of the libraries that each step uses ie scraping the product links and finally collecting the data.


Requests


This is one of the most widely used libraries while making HTTP requests in Python. Both parts of the code use requests to make GET requests to the WebMD website. The right headers and cookies used with the requests library simulate a browser, helping the scraper get the web pages needed without being blocked by the security of the site.


BeautifulSoup


Parsing the HTML documents to extract needed data is through BeautifulSoup, an implementation from the bs4 library. Once requests have fetched a webpage, it takes over and starts parsing its HTML content. This would enable the script to search and fetch specific elements such as anchor tags or div containers. In product link scraping code, it identifies the section carrying product links. In the final phase of data scraping, it captures more detailed information such as the names of drugs, uses, side effects, and much more on every page of the product.

SQLite3 is an SQLite database library. It presents an easy and lightweight way to put scraped data into a database. As SQLite3 creates a SQLite database called webmd_webscraping.db and a SQLite table called product_links at which the product links get kept in the process of scraping product links. In the final step of data scraping, the same database is expanded. A new table called final_product_data is created to hold the detailed drug information. Another table named failed_urls is created to log any links that do not load. This helps in storing the data efficiently and reliably in the process of scraping.

Time is used to give pauses between web requests. It is helpful in ensuring that the scraper does not make too many requests in a short period, which might make it look suspicious and may get blocked. This product link scrape code in addition, adds a delay between requests by using time.sleep(2) so not to overwhelm the WebMD servers. Adding some random sleep time between requests in the final data scrape code will simulate more like human browsing which avoids getting detected.

The final data scraping script uses the random function for selecting one of the user agents from the list at a random. By changing up the user agents, each time the scraper sends in requests, it looks like something different - either browsers or devices. This is difficult for the website to have identified it as some bot, and random choice offers another layer of protection against being flagged for IP bans or hitting limits on requests from a target website.

These libraries work together to make a powerful flexible web scraping tool. This tool, in turn, collects data on products from WebMD; organizes the collected data using SQLite; and follows steps to ensure that the act of the scraper mimics a real person by alternating user agents and adding time delays.


STEP 1 : Product link scraping

Importing Libraries

import time
import requests
from bs4 import BeautifulSoup
import sqlite3

This code imports essential libraries for web scraping and data handling. It includes requests for making HTTP requests, random and time for adding delays, sqlite3 for database interaction.


Defining the Base URL

# Base URL for WebMD
BASE_URL = "https://www.webmd.com"

The BASE_URL is the variable where you're going to store the web address of the root website. In this case, BASE_URL points to "https://www.webmd.com," the core website from which you'll begin scraping drug-related information. Store the base URL separately, so while scraping, you can append different paths to create full URLs for different pages. This makes the code pretty flexible and easy to maintain whenever the base URL changes.


Configuring HTTP Headers and Cookies for Requests

# Headers and cookies for HTTP requests
HEADERS = {
    'authority': 'www.googletagservices.com',
    'method': 'GET',
    'scheme': 'https',
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br, zstd',
    'accept-language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6',
    'cache-control': 'no-cache',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
}

COOKIES = {
    'name': 'VisitorId', 
    'value': '3b80b89d-7007-4be2-84fe-377d67abff64', 
    'domain': '.webmd.com', 
    'path': '/'
}

The HEADERS and COOKIES dictionaries are used to mimic a real user's browser when making HTTP requests to the WebMD website. These components help avoid being blocked by the website and ensure that responses are served correctly.

  • HEADERS: Defines the metadata sent with the request, such as user-agent (which tells the server what browser and operating system are being used) and accept-encoding (specifying the supported response encoding). This helps in simulating a legitimate browser request and managing caching.

  • COOKIES: Represents the stored data from previous interactions with the website, such as user session information. These values are important for maintaining state across multiple requests and may be required to access certain pages or resources on WebMD.

Together, the headers and cookies make the web scraping process more seamless and efficient by reducing the chances of being detected as a bot.


Fetching and Parsing Webpage Content with BeautifulSoup

def get_soup(url):
    """
    Sends an HTTP GET request to the provided URL, retrieves the HTML
    content of the page, and parses it into a BeautifulSoup object.

    Args:
        url (str): The URL of the webpage to retrieve.

    Returns:
        BeautifulSoup: Parsed HTML of the page.
    """
    response = requests.get(url, headers=HEADERS, cookies=COOKIES)
    return BeautifulSoup(response.text, 'html.parser')

The get_soup function fetches and parses the content of the webpage with the help of requests and BeautifulSoup libraries. It sends an HTTP GET request to the given URL, using predefined headers and cookies for the simulation of a browser session so that proper access to the page is possible. The HTML content of the response gets extracted and parsed into a BeautifulSoup object. The parsed object allows the navigation of the HTML structure, where the extraction of data and searching operations by tags, classes, or attributes can be made possible. It's easier to work with web pages while scraping by returning a ready-to-use BeautifulSoup object.


Extracting Unique Links from a Specific Web Page Section

def scrape_links_from_list(soup):
    """
    Extracts all unique links from a specific div in the parsed HTML.

    This function looks for anchor (`<a>`) tags within a div element that 
    has the class 'drugs-search-list-conditions'. It collects the `href` 
    attributes of those anchor tags, prefixes them with the base URL, and 
    returns a set of full URLs. Using a set ensures uniqueness, so duplicate 
    links are automatically removed.

    Args:
        soup (BeautifulSoup): The parsed HTML content of a webpage, represented 
                              as a BeautifulSoup object. This object is usually 
                              created after fetching and parsing the webpage content.

    Returns:
        set: A set containing unique URLs (as strings) extracted from the specified 
             div. Each URL is prefixed with `BASE_URL` to form a complete link.
    """
    div = soup.find('div', class_='drugs-search-list-conditions')
    if div:
        return {BASE_URL + a['href'] for a in div.find_all('a')}
    return set()

scrape_links_from_list : The function concept is supposed to find all the unique links that occur in some class called 'drugs-search-list-conditions' of an HTML page. This function locates all the <a> tags which are found inside a div that has class 'drugs-search-list-conditions'. Then it picks up their href attributes which happen to be the relative paths of links. Then, these relative paths are combined with an already defined variable called BASE_URL to generate an absolute path. And all returns as a set, thus making sure only a link gets stored without repeating themselves. In case a page doesn't contain the targeted div element, the function returns an empty set. Such a method comes in handy when one would want to scrape only meaningful links from structured sections of the webpage.


Processing Sub-Alpha Links and Storing Unique Product URLs in the Database

def process_subalpha_links(alpha_soup, cursor, conn, unique_links):
    """
    Processes sub-alpha links from the current alphabet page and extracts unique 
    product links for database insertion.

    This function handles the second layer of alphabet-based navigation. It identifies 
    'sub-alpha' links from the current page and processes each sub-alpha link to 
    extract product links. These links are checked for uniqueness using the 
    `unique_links` set, and new links are inserted into the `product_links` table.

    Args:
        alpha_soup (BeautifulSoup): Parsed HTML of the current main alphabet page.
        cursor (sqlite3.Cursor): SQLite cursor for executing SQL queries.
        conn (sqlite3.Connection): SQLite connection to commit changes to the database.
        unique_links (set): Set of unique product links to avoid duplicate entries.

    Function Workflow:
        1. Looks for the 'sub-alpha' section on the alphabet page.
        2. Fetches each sub-alpha page and extracts product links.
        3. Inserts new, non-duplicate links into the database and updates the set.
        4. Commits changes to the database after processing all links.

    """
    subalpha_ul = alpha_soup.select_one(
        '.alpha-container.subalpha-container .browse-letters.squares.sub-alpha.sub-alpha-letters'
        )
    if subalpha_ul:
        for subalpha_link in subalpha_ul.select('li.sub-alpha-square a[href]'):
            subalpha_url = BASE_URL + subalpha_link['href']
            subalpha_soup = get_soup(subalpha_url)
            final_links = scrape_links_from_list(subalpha_soup)
            
            # Update the unique links set and insert new unique 
            # links into the database
            new_links = final_links - unique_links
            unique_links.update(new_links)
            for link in new_links:
                cursor.execute(
                    "INSERT INTO product_links (product_link) VALUES (?)",
                    (link,)
                    )
        
        conn.commit()

The process_subalpha_links function is meant to handle a deeper level of navigation on an alphabet-page-based webpage. That is after it lands on a page 'A', 'B' etc. it looks for subdivision in that page. Once found, the sub-alpha link directs the user to even deeper pages where the target URLs are found. The extracted URLs of the products is added into the SQLite data but only unique ones whose previous insertion has not happened yet.

The function first scans all the sub-alpha sections in the parsed HTML, using BeautifulSoup. For every sub-alpha link, it then fetches and parses the page, getting all the links of the products. Checks these extracted links against the unique_links set to avoid adding duplicate entries in the SQLite database. Every new link is added to the product_links table, and a new update is done for the unique links set with regard to the newly added URL. Changes to the database are made at the last stages so that links to products can be saved.

This approach effectively manages the hierarchy of links and ensures that product URLs are stored without duplications, making the scraping process better.


Creating and Initializing the SQLite Database for Product Links

def create_database():
    """
    Initializes the SQLite database and creates the necessary table 
    for storing product links.

    This function checks whether the SQLite database file (`webmd_webscraping.db`) 
    exists, and if not, it creates it. Within the database, it ensures that the
    `product_links` table is present, creating it if it does not already exist.
    The `product_links` table contains the following columns:
    
    - `id`: An auto-incremented integer that serves as the primary key.
    - `product_link`: A unique text field that stores the scraped product URLs.
    - `status`: An integer field that defaults to 0, indicating whether 
                the link has been processed (e.g.,  0 for unprocessed 
                and 1 for processed).
    
    After ensuring that the table is set up, the function returns 
    the database connection and cursor  objects for use in subsequent
    database operations.

    Returns:
        tuple: A tuple containing:
            - conn (sqlite3.Connection): A SQLite database connection object, 
              which can be used to manage transactions and commit changes 
              to the database.
            - cursor (sqlite3.Cursor): A SQLite cursor object, which is used 
            to execute SQL queries.
    
    """
    conn = sqlite3.connect('webmd_webscraping.db')
    cursor = conn.cursor()
    
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS product_links (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        product_link TEXT UNIQUE,
        status INTEGER DEFAULT 0
    )
    ''')
    
    conn.commit()
    return conn, cursor

The create_database function sets up the foundation for storing scraped product links by creating an SQLite database and establishing a table for managing the URLs. The function checks if the database file (webmd_webscraping.db) exists, creating it if necessary. Within the database, it ensures the existence of the product_links table, which is designed to store the URLs of product links collected during the web scraping process.

The table structure includes three columns:

  1. id - a primary key that auto-increments for each entry.

  2. product_link - a unique field that holds the actual product URLs.

  3. status - an integer field (defaulting to 0) used to indicate whether a link has been processed (e.g., 0 for unprocessed and 1 for processed).

By returning the database connection (conn) and cursor (cursor) objects, the function enables further database operations, such as inserting or querying data, in subsequent steps of the scraping workflow. Additionally, it uses the IF NOT EXISTS clause in the SQL query to avoid duplicating the table if it already exists, ensuring seamless initialization.

The function plays a critical role in organizing and tracking the status of the product links during the scraping process, allowing for efficient data management and ensuring that each link is processed only once.


Scraping WebMD Drug Links and Storing in an SQLite Database

def scrape_webmd_links():
    """
    Scrapes WebMD drug links for all alphabets and saves 
    unique links into an SQLite database.

    This function iterates over all alphabet characters (
    from 'a' to 'z' and '0' for numeric entries), constructs 
    a URL for each letter on WebMD's drug index page, and 
    retrieves the corresponding HTML content.  It parses the 
    page to find sub-alphabet links and scrapes all drug-related 
    links from these sub-pages. Unique links are then inserted 
    into the SQLite database for storage, ensuring no duplicates 
    are added.

    Workflow:
    1. The function initializes the SQLite database by calling 
       `create_database()` and prepares a set to track unique links.
    2. It iterates over each alphabet character and retrieves the 
       corresponding WebMD URL (e.g., for 'a', it scrapes 
       "https://www.webmd.com/drugs/2/alpha/a").
    3. For each letter page, it extracts relevant sub-alphabet 
       links and processes those to get the final drug links.
    4. Extracted links are stored in the `product_links` table, 
       with uniqueness enforced by both the database and the set.
    5. A short delay (`time.sleep(2)`) is added between each 
       request to avoid overwhelming the server.
    6. Once scraping is complete, the database connection is closed.

    """
    conn, cursor = create_database()
    unique_links = set()

    try:
        for char in 'abcdefghijklmnopqrstuvwxyz0':
            url = f"{BASE_URL}/drugs/2/alpha/{char}"
            alpha_soup = get_soup(url)
            
            alpha_text_div = alpha_soup.select_one(
                '.alpha-text[data-metrics-module="drugs-az"]'
                )
            if alpha_text_div:
                process_subalpha_links(
                    alpha_text_div, cursor, 
                    conn, unique_links
                    )
                
            time.sleep(2)  # Short delay between each main URL request
    
    finally:
        conn.close()

The scrape_webmd_links function does the following: it automatically scrapes drug-related links from WebMD's drug index, covering all alphabet characters from 'a' to 'z', as well as numeric entries ('0'). It systematically constructs URLs for each alphabet character, retrieves the corresponding HTML page, and extracts the sub-alphabet and drug links. These links are then stored in an SQLite database with a mechanism to ensure uniqueness through both a Python set and database constraints.


The workflow starts off by invoking the create_database() function, which should initialize both the SQLite database and the product_links table. The set (unique_links) keeps track of those links already processed in this session. Then the function enters a loop for each character fetching the page corresponding to it off WebMD. For each of the pages, it processes links to the sub-alphabets, retrieves the URL of the drug, inserts it into the database not to duplicate it.


The function ensures that it does not bombard the server by placing a 2-second delay between every request. After scraping all links, SQLite connection is closed in order to ensure data integrity. This function handles the whole operation of scraping and storage within the database, so an exhaustive list of URLs regarding drugs can be retrieved with no duplication.


Entry Point for WebMD Drug Link Scraping

# Start scraping and save the links to the SQLite database
if __name__ == "__main__":
    """
    Entry point for executing the WebMD drug link scraping process.

    When this script is run directly, the `scrape_webmd_links()`
    function is invoked to start the scraping process.
    It scrapes WebMD's drug pages, extracts unique product links 
    for each alphabet character (a-z and 0), and stores them in 
    an SQLite database named 'webmd_webscraping.db'.

    The scraped links are stored in the `product_links` table with 
    two columns:
        - `product_link`: Contains the URL of the drug-related product.
        - `status`: A default status column (set to 0) for potential 
                    future use (e.g., marking links as processed).
    """
    scrape_webmd_links()

The following block of code, if __name__ == "__main__": is the entry point for executing the process of scraping links on WebMD for drugs. This script calls the scrape_webmd_links() function when executed directly to start the scraping of drug pages on WebMD. For each alphabet character, it collects unique drug-related links from a-z and '0' and stores them in an SQLite database called webmd_webscraping.db.

Two major columns are present in the product_links table. The product_link column contains the URL of the drug product and the status column, which has been set to 0 as a default column. The status column will come into handy when tracking the processing state for each link in the future extensions of this project. To run the script, it can be run from the command line with Python, and the needed dependencies, requests, beautifulsoup4, and sqlite3 should be installed.This script will open a new SQLite database if it doesn't already exist and begin storing the scraped links in an efficient manner.


STEP 2 : Final Data Scraping From Product Links

Importing Libraries

import requests
from bs4 import BeautifulSoup
import time
import random
import sqlite3

The requests library sends HTTP requests, BeautifulSoup parses HTML/XML, time handles delays, random generates random numbers, and sqlite3 manages SQLite database operations.


Configuration Variables

# Configuration Variables
DB_NAME = 'webmd_webscraping.db'
USER_AGENTS_FILE = 'data/user_agents.txt'
SLEEP_MIN = 2
SLEEP_MAX = 5

The Configuration Variables section defines key parameters for the script:

  1. DB_NAME: Specifies the name of the SQLite database (webmd_webscraping.db) where the scraped data will be stored. This is the database that will hold the drug information and any failed URLs during the scraping process.

  2. USER_AGENTS_FILE: Refers to the file path (data/user_agents.txt) that contains a list of user agent strings. These user agents are used to mimic different browsers when making requests to the website, helping avoid detection and blocking by the server.

  3. SLEEP_MIN and SLEEP_MAX: These variables control the random sleep interval between consecutive web scraping requests, ranging from 2 to 5 seconds. The random delay helps prevent the server from detecting the scraping activity as automated or overwhelming it with too many requests in a short time.

Together, these configuration variables ensure that the script runs efficiently, accesses the website responsibly, and stores the scraped data properly.


Headers and Cookies to Mimic a Real Browser Request

# Headers and cookies to mimic a real browser request
headers = {
    'authority': 'www.googletagservices.com',
    'method': 'GET',
    'scheme': 'https',
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br, zstd',
    'accept-language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6',
    'cache-control': 'no-cache'
}

cookies = {
    'name': 'VisitorId',
    'value': '3b80b89d-7007-4be2-84fe-377d67abff64',
    'domain': '.webmd.com',
    'path': '/'
}

This section sets up headers and cookies that mimic a real browser's behavior during web requests. These elements are essential for avoiding detection when scraping websites like WebMD, which may block automated scripts that don’t resemble real users.

  1. Headers: The headers dictionary includes parameters like accept, accept-encoding, accept-language, and cache-control, which tell the server what kind of content to return, how it can be encoded, and which languages are acceptable. These headers simulate the browser's settings when making the request, helping the scraper appear as if it's coming from a regular browser.

  2. Cookies: The cookies dictionary includes the VisitorId, which tracks the visitor's session on the WebMD site. This cookie is passed along with each request to maintain the browsing session, which might be needed to access certain content.

By using headers and cookies, the scraper behaves more like a legitimate user, increasing the likelihood of successful data retrieval without being blocked by the website’s security measures.


Database Setup and Table Creation

# Database setup
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()

# Create tables if not exist
cursor.execute('''
    CREATE TABLE IF NOT EXISTS final_product_data (
        id INTEGER PRIMARY KEY,
        product_link TEXT,
        drug_name TEXT,
        generic_name TEXT,
        uses TEXT,
        side_effects TEXT,
        warnings TEXT,
        precautions TEXT,
        interactions TEXT,
        overdose TEXT
    )
''')

cursor.execute('''
    CREATE TABLE IF NOT EXISTS failed_urls (
        id INTEGER PRIMARY KEY,
        url TEXT
    )
''')

conn.commit()

This section connects to SQLite Database and ensures that database tables for storing scraped data along with failed URLs are created in it. The script does connect to the database using sqlite3.connect(DB_NAME), where the actual name of the DBNAME is the database file or rather webmd_webscraping.db. If the file didn't exist, it has created automatically. A cursor object is then initialized as SQL commands are to be executed. Two tables are created, in case they do not exist: final_product_data and failed_urls. The table final_product_data contains all the relevant information regarding a drug, including product link, drug name, generic name, uses, side effects, warnings, precautions, interactions, and overdose details. In case of a failure to scrape any URL, it gets logged in the table failed_urls, making it easy to track and retry that URL later. Once these commands are executed to create the tables, conn.commit() commits all of those changes to the database, ensuring that the structure of the database is correctly defined before entering the web scraping phase.


Loading User Agents from File

# Functions
def load_user_agents(user_agents_file):
    """
    Loads a list of user agents from a specified text file.

    Args:
        user_agents_file (str): The path to the text file containing user agent strings, 
                                with each user agent on a new line.

    Returns:
        list: A list of user agent strings read from the file, with whitespace stripped.
    """
    with open(user_agents_file, mode='r', encoding='utf-8') as file:
        return [line.strip() for line in file if line.strip()]

This function, load_user_agents, reads the user agent strings from a given text file and later used to mimic browser requests while web scraping. The user agent is an identifier used by a website to know what sort of device or browser the request is coming from. This function accepts one argument, namely the file path to a text file containing multiple user agents, each one on a new line. Within the function, this file is opened for reading with UTF-8 encoding to enable support for all languages and characters. This function iterates through every line in the file, removing leading/trailing whitespaces and then aggregates a list of user agents with no empty lines. The function then returns this list of user agents to later pick a random one for every web scraping request, avoiding the detection/blocking of requests by the targeted website.


Selecting a Random User Agent

def get_random_user_agent():
    """
    Selects and returns a random user agent string from the pre-loaded list of user 
    agents.

    Returns:
        str: A randomly selected user agent string.
    """
    return random.choice(user_agents)

The get_random_user_agent function is responsible for selecting a random user agent from a pre-loaded list of user agents. This list, generated by the previous function, contains various user agent strings that mimic different browsers or devices. By selecting a random user agent for each request, this function helps avoid detection by websites that block or throttle repeated requests from the same user agent. It returns a single, randomly chosen user agent string, which will be used to simulate a browser during web scraping activities. This strategy helps make the web scraper's behavior appear more natural and less likely to trigger anti-bot measures.


Retrieving and Parsing Web Page Content

def get_soup(url):
    """
    Makes an HTTP GET request to the specified URL and parses the HTML content using 
    BeautifulSoup.

    Args:
        url (str): The URL to retrieve and parse.

    Returns:
        BeautifulSoup: Parsed HTML content of the requested page.
    """
    headers['user-agent'] = get_random_user_agent()
    response = requests.get(url, headers=headers, cookies=cookies)
    return BeautifulSoup(response.text, 'html.parser')

The get_soup function is defined so that it fetches the HTML of a given URL and makes a BeautifulSoup parse out of it. It starts with setting the 'User-Agent' header to a random user agent, chosen by the get_random_user_agent function, to simulate an actual browser request. It then simulates a valid browser session functionally by conducting an HTTP GET request to a specified URL, passing in the headers and cookies. The received response is parsed with the use of BeautifulSoup, returning the content as a BeautifulSoup object that can be used further to extract certain data from the webpage.


Extracting Text from HTML Div Elements

def extract_text_from_div(soup, selectors):
    """
    Extracts text content from a specified HTML div element based on a list of CSS  
    selectors.

    Args:
        soup (BeautifulSoup): The parsed HTML content from which to extract text.
        selectors (list): A list of CSS selectors to identify the target div elements.

    Returns:
        str: The extracted text content from the first matching div. If no matching div is 
             found , returns "Not Available".
    """
    for selector in selectors:
        target_div = soup.select_one(selector)
        if target_div:
            return target_div.get_text(separator='\n', strip=True)
    return "Not Available"

This function extract_text_from_div is going to be used for extracting the text content from particular HTML <div> elements contained in the parsed HTML document. It accepts two parameters: soup, which is a BeautifulSoup object that holds the parsed HTML; and selectors, which is a list of CSS selectors targeting the desired <div> elements. This function iterates over the given selectors and calls the select_one method on each one to return the first <div> element found by the CSS selector. If a <div> element is encountered, the function takes its text content; there are newline characters between lines to take out leading and trailing whitespace. If the string has no <div>, then this function will return "Not Available" so that it can be explicitly clear whether it found or not in an open manner. This is useful to locate information with much efficiency from the web pages in their HTML structure.


Extracting the Generic Name of a Drug

def extract_generic_name(soup):
    """
    Extracts the generic name of a drug from the provided parsed HTML content.

    Args:
        soup (BeautifulSoup): The parsed HTML content of the drug's page.

    Returns:
        str: The extracted generic name of the drug. If the generic name is not found,
             returns "Not Available".
    """
    generic_name = soup.find('h3', class_='drug-generic-name')
    if generic_name:
        return generic_name.get_text(strip=True).split(':')[-1].strip()
    
    drug_info_holder = soup.find('div', class_='drug-info-holder')
    if drug_info_holder:
        generic_name_li = drug_info_holder.find('li', class_='generic-name')
        if generic_name_li:
            generic_name_span = generic_name_li.find('span')
            if generic_name_span:
                return generic_name_span.get_text(strip=True)
    return "Not Available"

The extract_generic_name is defined to fetch the name of a drug in generics from the parsed HTML content in its web page. There is one argument, which accepts the soup-a BeautifulSoup object that contains the parsed HTML. It begins to hunt for the generic name beginning with an <h3> element that has its class as drug-generic-name. If this element exists, the function pulls and returns the text that comes after a colon(:) after stripping any white space. If <h3> does not exist, the function looks for a <div> class drug-info-holder. Then it looks inside this div for a <li> class generic-name. When available, it looks within this list item for a <span> to recover the text of the generic name. If none of those are found, the function returns "Not Available", which is pretty strong indication that the generic name was not recovered. Thus, it gets the needed information while pretty effectively dealing with the varied types of HTML structures.


Scraping Drug Information from a Web Page

def scrape_drug_info(url):
    """
    Scrapes detailed drug information from the specified URL.

    Args:
        url (str): The URL of the drug page to scrape.

    Returns:
        dict: A dictionary containing the scraped information, including:
            - 'product_link' (str): The URL of the drug.
            - 'drug_name' (str): The name of the drug. Returns "Not Available" if not 
                                 found.
            - 'generic_name' (str): The generic name of the drug. Returns "Not Available" 
                                    if not found.
            - 'uses' (str): The uses of the drug. Returns "Not Available" if not found.
            - 'side_effects' (str): The side effects of the drug. Returns "Not Available" 
                                    if not found.
            - 'warnings' (str): The warnings associated with the drug. Returns "Not 
                                Available" if not found.
            - 'precautions' (str): The precautions for using the drug. Returns "Not 
                                   Available" if not found.
            - 'interactions' (str): The interactions with other drugs. Returns "Not 
                                    Available" if not found.
            - 'overdose' (str): Information regarding overdose. Returns "Not Available" if 
                                not found.
    """
    soup = get_soup(url)
    
    drug_name = soup.find('h1', class_='drug-name')
    drug_name = drug_name.get_text(strip=True) if drug_name else "Not Available"
    
    generic_name = extract_generic_name(soup)
    
    selectors = {
        'uses': ['.uses-container.center-content', '.uses-container'],
        'side_effects': ['.side-effects-container.center-content', '.sideeffects-container'],
        'warnings': ['.warnings-container.center-content', '.warnings-container'],
        'precautions': ['.precautions-container.center-content', '.precautions-container'],
        'interactions': ['.interactions-container.center-content', '.interactions-container'],
        'overdose': ['.overdose-container.center-content', '.overdose-container']
    }
    
    extracted_info = {key: extract_text_from_div(soup, sel) for key, sel in selectors.items()}
    
    return {
        'product_link': url,
        'drug_name': drug_name,
        'generic_name': generic_name,
        'uses': extracted_info['uses'],
        'side_effects': extracted_info['side_effects'],
        'warnings': extracted_info['warnings'],
        'precautions': extracted_info['precautions'],
        'interactions': extracted_info['interactions'],
        'overdose': extracted_info['overdose']
    }

The scrape_drug_info function is developed to get detailed information of drugs from a given URL of the web page. It has one argument: url-the address of the page that contains the drug information for scraping. First, the function calls the get_soup function to fetch and parse the given URL's HTML content. Then, it tries to find the name of the drug by finding an <h1> element with class drug-name, returning the text content after stripping any white space; if not found, it defaults to "Not Available." Finally, the function calls extract_generic_name to obtain the generic name of the drug.


The function creates a selectors dictionary to gather additional information like use, side effects, warnings, precautions, and possible interactions along with its overdoses. This function also contains keys as each different piece of information with lists for respective CSS selectors to figure out the relevant HTML elements available in the page. Applying this dictionary comprehension, each selectors list applies the use of extract_text_from_div with the content of a different key to get an assigned text value for extraction of text content from each list.


Finally, the function returns a dictionary containing the extracted information that includes the product link, drug name, generic name, and other miscellaneous data like uses, side effects, warnings, precautions, interactions, and overdose data. It ensures that "Not Available" is returned for the unavailable information, thus showing that when specific details could not be found, their actual lack is clearly highlighted. Such a comprehensive design thus efficiently extracts data in clarity with robustness as opposed to handling different kinds of layouts on pages.


Scraping Drug Data from the Database

def scrape_drug_data_from_database():
    """
    Scrapes drug information from URLs stored in the database and saves the results.

    This function retrieves URLs from the 'product_links' table with a status of 0 
    (indicating that they have not been scraped yet). For each URL, it calls the 
    `scrape_drug_info` function to obtain drug information. If the drug name is not 
     available, the URL is recorded in the 'failed_urls' table. Otherwise, the scraped 
     data is inserted into the 'final_product_data' table.
     After successfully scraping a URL, its status is updated to 1 to indicate it has been 
     processed.

    The function handles exceptions by logging errors and saving failed URLs to the 
    'failed_urls' table. It also introduces a random sleep interval between requests to 
     avoid overwhelming the server.

    Returns:
        None
    """
    cursor.execute('SELECT product_link FROM product_links WHERE status = 0')
    urls = cursor.fetchall()

    for url, in urls:
        try:
            drug_data = scrape_drug_info(url)
            
            if drug_data['drug_name'] == "Not Available":
                cursor.execute('INSERT INTO failed_urls (url) VALUES (?)', (url,))
            else:
                cursor.execute('''
                    INSERT INTO final_product_data (product_link, drug_name, generic_name, uses, side_effects,
                    warnings, precautions, interactions, overdose)
                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
                ''', (
                    drug_data['product_link'], drug_data['drug_name'], drug_data['generic_name'],
                    drug_data['uses'], drug_data['side_effects'], drug_data['warnings'],
                    drug_data['precautions'], drug_data['interactions'], drug_data['overdose']
                ))
            
            # Update status to 1 (scraped)
            cursor.execute('UPDATE product_links SET status = 1 WHERE product_link = ?', (url,))
            conn.commit()
            
        except Exception as e:
            print(f"Error processing URL {url}: {e}")
            cursor.execute('INSERT INTO failed_urls (url) VALUES (?)', (url,))
            conn.commit()
        
        time.sleep(random.uniform(SLEEP_MIN, SLEEP_MAX))

# Load user agents and start scraping
user_agents = load_user_agents(USER_AGENTS_FILE)
scrape_drug_data_from_database()

# Close database connection
conn.close()

This function retrieves drug information from URLs in a database. It uses the product_links table in the database where the status of scraping is marked as 0, meaning the links have not been scraped yet. It will run a SQL query to fetch the relevant URLs first and then iterate over those fetched URLs. For each URL, it tries to scrape information on drugs by using the scrape_drug_info function.


This script inserts scraped data into final_product_data table in case if the drug name was identified; otherwise, the URL of non availability of the drug name gets logged in failed_urls to reiterate through all other URLs again. This way after processing the URL function goes for update of record in the table called product_links that was in "0" status for its url changes "1." It also tracks all previously already passed so the same job will not occur with the same URL twice.


Another essential feature of the function: error handling. It includes exception control, which sometimes could be faced during web scraping. If an error emerges in a URL, it produces the error message and adds a troublesome url to the failed_urls table. Thus, the requests did not fall out for the server; it provides including a random sleep between each requests applying time.sleep method with time to sleep randomly defined in constant values SLEEP_MIN and SLEEP_MAX. After successful scrapping, it returns conn.close() which closes down all the connections of the database and the resources used are free. It therefore only maintains a fair balance between being efficient and handling errors because this is the very valuable information related to drugs that is gathered gradually.


Libraries and Versions


This code utilizes several key libraries to perform web scraping and data processing. The versions of the libraries used in this project are as follows: BeautifulSoup4 (v4.12.3) for parsing HTML content, Requests (v2.32.3) for making HTTP requests. These versions ensure smooth integration and functionality throughout the scraping workflow.


Conclusion


The intention of this WebMD web scraping is to show an organized way of extracting data from webmd while maintaining ethical scraping principles. This project incorporates structured workflows, intelligent request handling, and database integration to facilitate effective and reliable data collection. The use of SQLite as a storage mechanism allows for easy retrieval and analysis, as well as mechanisms for logging failed URLs that can be re-scraped to maximize data completeness.


This project is useful in automating the retrieval of medical data and offering insights into how the data can be further analyzed or integrated into health care applications.


AUTHOR


I’m Ambily, Data Analyst at Datahut. I specialize in building automated data pipelines that convert scattered medical and pharmaceutical information into structured, usable formats—empowering healthcare researchers, analysts, and data-driven platforms.


At Datahut, we’ve helped businesses across industries harness the power of web scraping to streamline research, monitor updates, and make informed decisions. In this blog, I’ll explain why web scraping is essential for extracting accurate and timely drug information from WebMD.


If you're looking to build a custom scraping solution for healthcare data, feel free to reach out using the chat widget on the right. We’re here to help you unlock the full potential of structured medical insights.

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page