top of page
  • Writer's pictureAshwin Joseph

A Guide to Scrape Indeed using Selenium and BeautifulSoup

A Guide to Scrape Indeed using Selenium and BeautifulSoup

Indeed is a well-known job search website that has gained worldwide popularity due to its extensive job listings. Job seekers and employers both benefit from its resources, which allow users to search for, apply to, and compare job opportunities. The platform is a top choice among job search engines and contains a wealth of job postings and related information.


Examining job postings and associated information from Indeed using web scraping can provide valuable insights into the current job market. This information can help job seekers and employers keep up with evolving job trends, in-demand skills, and expected salaries.


In this blog post, we'll be discussing how to scrape job data from Indeed website. We'll randomly select job positions and locations and scrape Indeed for these positions and locations to obtain some valuable information & further do a visual analysis.


Target Jobs and Locations

To begin our scraping process, we need to first identify the job positions and locations we want to target. Here, we have selected 6 positions and 3 locations as targets and scraped data for this information.


Jobs

  • Data Scientist

  • Business Analyst

  • Data Engineer

  • Python Developer

  • Full Stack Developer

  • Machine Learning Engineer

Locations

  • New York

  • Los Angeles

  • California


The Attributes

For each job posting, the following attributes are extracted:

  1. Job link: It is the unique address of the job posting on the Indeed website.

  2. Job title: It specifies the job role or position of the employee within the organization.

  3. Company name: It is the name of the company which has posted the job opening.

  4. Company location: It specifies the place where the company is situated.

  5. Salary: It specifies the average annual salary provided for the job.

  6. Job type: It specifies the employment status such as full-time, part-time.

  7. Rating: It is the rating of the company provided by its employees.

  8. Job description: It is a short detail or specifications about the job.

Required Libraries

After we have identified the job positions, locations, and attributes to be scrapped, the next step is to import the required libraries. Here, we will be scraping Indeed using Selenium which is a tool used to automate web browsers. The libraries which need to be imported are:

  • Selenium web driver is a tool used for web automation. It allows a user to automate web browser actions such as clicking a button, filling in fields, and navigating to different websites.

  • ChromeDriverManager is a library that simplifies the process of downloading and installing the Chrome driver, which Selenium requires to control the Chrome web browser.

  • BeautifulSoup is a Python library that is used for parsing and pulling data out of HTML and XML files.

  • The lxml library of Python is used for the processing of HTML and XML files. An ElementTree or etree is a module in lxml used to parse XML documents.

  • The csv library is used to read and write tabular data in CSV format.

  • The time library is used to represent time in different ways.

# import necessary modules
from bs4 import BeautifulSoup
from lxml import etree as et
from csv import writer
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

Scraping Process

After importing the required libraries, the next step is to initialize 2 lists. The first list will be initialized with the identified job positions, and the second list will be initialized with the identified job locations. Later in the scraping process, each job position will be combined with each job location, and we will search the Indeed website for each combination.


For each combination, the search result page contains 15 job postings. We will be scraping 10 search result pages for each combination. Each search result page has a different URL. Initializing the URL of each web page in our program is not a feasible approach. Therefore, we identified a single URL, to which we will be appending the values of each combination and scrape the job details from each page. The URL is initialized and assigned to a variable named pagination_url. Besides this, we have another variable named base_url. The scraped job links are not complete and, therefore, invalid. They need to be appended with the base_url in order to form a valid URL.


# define job and location search keywords
job_search_keyword = [' Data+Scientist', 'Business+Analyst', 'Data+Engineer', 'Python+Developer', 'Full+Stack+Developer', 'Machine+Learning+Engineer']
location_search_keyword = ['New+York', 'California', 'Los+Angeles']

# define base and pagination URLs
base_url = 'https://www.indeed.com'
paginaton_url = "https://www.indeed.com/jobs?q={}&l={}&radius=35&start={}"

Now, to start the scraping process, we need to open the web browser so that Selenium can be used to interact with it. For this, we create an instance of the Chrome web driver using the ChromeDriverManager method. This instance is assigned to a variable named driver. Once the web driver is downloaded and installed, the get() method is used to open the Indeed website in the Chrome web browser.


# initialize Chrome webdriver using ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
# open initial URL
driver.get("https://www.indeed.com/q-USA-jobs.html?vjk=823cd7ee3c203ac3")

Now that the Indeed website is opened, we can start searching and scraping job postings for each combination of position and location. As mentioned earlier, each combination will have a unique URL. For each URL, we will first call the get_dom() function with the URL as the parameter. In this method, the web Chrome driver will first open the URL and retrieve the page source code using the driver.page_source attribute. It will contain the HTML code of the loaded page and will be stored in a variable named page_content. Then, we will create a BeautifulSoup object called product_soup by parsing the page source code using the 'html.parser' HTML parser converts it to an ElementTree object using the et.HTML() method, and returns the resulting DOM tree. This DOM is a hierarchical representation of the HTML structure of the page.


# function to get DOM from given URL
def get_dom(url):
   driver.get(url)
   page_content = driver.page_source
   product_soup = BeautifulSoup(page_content, 'html.parser')
   dom = et.HTML(str(product_soup))
   return dom

Extraction Process


Each search result page contains 15 job postings. Each job posting is a div element with a common class name. After extracting all the job posting elements, we will extract the attributes required by us for each job posting one by one. For that purpose, we have defined the following functions:


# functions to extract job link
def get_job_link(job):
   try:
       job_link = job.xpath('./descendant::h2/a/@href')[0]
   except Exception as e:
       job_link = 'Not available'
   return job_link


# functions to extract job title
def get_job_title(job):
   try:
       job_title = job.xpath('./descendant::h2/a/span/text()')[0]
   except Exception as e:
       job_title = 'Not available'
   return job_title


# functions to extract the company name
def get_company_name(job):
   try:
       company_name = job.xpath('./descendant::span[@class="companyName"]/text()')[0]
   except Exception as e:
       company_name = 'Not available'
   return company_name


# functions to extract the company location
def get_company_location(job):
   try:
       company_location = job.xpath('./descendant::div[@class="companyLocation"]/text()')[0]
   except Exception as e:
       company_location = 'Not available'
   return company_location


# functions to extract salary information
def get_salary(job):
   try:
       salary = job.xpath('./descendant::span[@class="estimated-salary"]/span/text()')
   except Exception as e:
       salary = 'Not available'
   if len(salary) == 0:
       try:
           salary = job.xpath('./descendant::div[@class="metadata salary-snippet-container"]/div/text()')[0]
       except Exception as e:
           salary = 'Not available'
   else:
       salary = salary[0]
   return salary


# functions to extract job type
def get_job_type(job):
   try:
       job_type = job.xpath('./descendant::div[@class="metadata"]/div/text()')[0]
   except Exception as e:
       job_type = 'Not available'
   return job_type


# functions to extract job rating
def get_rating(job):
   try:
       rating = job.xpath('./descendant::span[@class="ratingNumber"]/span/text()')[0]
   except Exception as e:
       rating = 'Not available'
   return rating


# functions to extract job description
def get_job_desc(job):
   try:
       job_desc = job.xpath('./descendant::div[@class="job-snippet"]/ul/li/text()')
   except Exception as e:
       job_desc = ['Not available']
   if job_desc:
       job_desc = ",".join(job_desc)
   else:
       job_desc = 'Not available'
   return job_desc

Here, each function retrieves an attribute and returns it. The retrieval process should be given inside a try block and errors can occur sometimes. The errors should be properly handled inside the except block.


Writing to a CSV File


Extracting the data is not enough. We need to store it somewhere so that we can use it for other purposes like analysis. Now we will see how to store the extracted data to a csv file.


The following code opens a csv file named “indeed_jobs.csv” in the write mode. Then we initialize a writer object named theWriter and write the column names to the csv file using the writerow() function. Then by using nested loops, we start scraping each job position. In the nested loops, the first loop selects a job position, the second loop selects a job position and the third loop iterates over a range of values from 0 to 100 with an increment of 10. These values are then concatenated into the pagination_url, which will form different URLs.


For each url under a combination of position and location, we call the get_dom() function, extract the job elements and store it into a list named all_jobs. Now the all_jobs list contains all the job openings for that particular combination. Next, we iterate through this list, extract the required attributes for each job opening and write it into the csv file. This step is repeated for each combination of position and location. After extracting each attribute, we call the sleep() method of the time library, which causes the program to pause for a few seconds. This is a way to avoid getting blocked during scraping. After all the process has been completed, we call the driver.quit() command, which closes the web browser that was opened by the selenium web driver.

# Open a CSV file to write the job listings data
with open('indeed_jobs1.csv', 'w', newline='', encoding='utf-8') as f:
   theWriter = writer(f)
   heading = ['job_link', 'job_title', 'company_name', 'company_location', 'salary', 'job_type', 'rating', 'job_description', 'searched_job', 'searched_location']
   theWriter.writerow(heading)
   for job_keyword in job_search_keyword:
       for location_keyword in location_search_keyword:
           all_jobs = []
           for page_no in range(0, 100, 10):
               url = paginaton_url.format(job_keyword, location_keyword, page_no)
               page_dom = get_dom(url)
               jobs = page_dom.xpath('//div[@class="job_seen_beacon"]')
               all_jobs = all_jobs + jobs
           for job in all_jobs:
               job_link = base_url + get_job_link(job)
               time.sleep(2)
               job_title = get_job_title(job)
               time.sleep(2)
               company_name = get_company_name(job)
               time.sleep(2)
               company_location = get_company_location(job)
               time.sleep(2)
               salary = get_salary(job)
               time.sleep(2)
               job_type = get_job_type(job)
               time.sleep(2)
               rating = get_rating(job)
               time.sleep(2)
               job_desc = get_job_desc(job)
               time.sleep(2)
               record = [job_link, job_title, company_name, company_location, salary, job_type, rating, job_desc, job_keyword, location_keyword]
               theWriter.writerow(record)

# Closing the web browser
driver.quit()

Data Analysis & Visualization


After extracting the data, an analysis is conducted on it to get valuable insights from it. The result of this analysis can be used by both employers and job seekers as it is beneficial to both of them. It is based on this analysis that we make certain decisions.


Here, we have conducted some basic analysis of the data we extracted, and they are presented and visualized below:


Total Job Vacancies in each Location

Our data consists of job postings from 3 different locations. We extracted a total of 2663 job postings. Among these, most of the job postings are from the New York region with a total of 900 job vacancies, followed by California and Los Angeles. The same is visualized below:

A Guide to Scrape Indeed using Selenium and BeautifulSoup

Average Salary of Each Job Position

We collected data for 6 different job positions. The average salary of each position is visualized below:


A Guide to Scrape Indeed using Selenium and BeautifulSoup

From the above graph, we can conclude that Machine Learning Engineers have the largest average salary followed by Data Engineer and Data Scientist.


Average Salary By Job and Location

We collected data from 3 different job locations. The average salary of each job position in these 3 locations is visualized below:

A Guide to Scrape Indeed using Selenium and BeautifulSoup

From the above graph, we can conclude the following:

  1. In New York, business analysts have the lowest average salary of $100753.35, and python developers have the highest average salary of $157071.6

  2. In California, business analysts have the lowest salary of $104824.98, and machine learning engineers have the highest average salary of $178835.62

  3. In Los Angeles, business analysts have the lowest average salary of 96380.44, and machine learning engineers have the highest average salary of $169312.84

  4. California has the highest average salary for each job position compared to New York and Los Angeles.


Unlock the full potential of web data for your brand

Gaining in-depth insights into job market data through web scraping on platforms like Indeed can be a game-changer for job seekers and employers alike. The ability to extract and analyze data from job postings, salaries, and skill requirements empowers individuals and businesses to make informed decisions in an ever-changing job landscape.


However, the process of web scraping can be intricate and time-consuming, requiring expertise and specialized tools. This is where Datahut, a leading provider of web scraping services, comes into play.


Datahut offers a seamless solution for companies seeking to harness the power of web scraping. By leveraging Datahut's web scraping services, companies can unlock the full potential of platforms like Indeed. Whether it's gathering market intelligence, tracking job trends, or obtaining competitive insights, Datahut's expertise ensures accurate and reliable data extraction, enabling businesses to stay ahead of the curve.


Don't miss out on the advantages that data-driven decision-making can bring to your organization. Take action today by partnering with Datahut and experience the transformative power of web scraping for your brand.


Visit Datahut to learn more


1,553 views1 comment

Do you want to offload the dull, complex, and labour-intensive web scraping task to an expert?

bottom of page