Predicting Expected Salary of TECH Jobs

Opeyemi Seriki
6 min readOct 8, 2022

--

Techs jobs like every other jobs have significantly grow over the years with relevant applications. What does takes to be into tech, and what’s expected substantial advantage in this field.

Technically, tech jobs is more on acquiring skills which are proficient in solving problems, hard-work, belonging to a community and list the goes on.

The team has worked on the expected substantial advantage in tech jobs like, data analyst, etc

The project has 5 phases including challenges faced how it was resolved.

  1. Data scraping and cleaning
  2. Exploratory Descriptive Analysis
  3. Model Training
  4. Deployment using streamlit

Data Scraping and Cleaning:

LinkedIn, Indeed and Glassdoor was considered by the team for data collection. However, due to lack of salaries in the structured data for two of the sites we decided to work with indeed.

Using beautiful soup, we were able to build a script for the scraping operation.

After the first scraping phase, we got 8700 records.

This value was poor for our model and we decided to scrap again. However, we got a blocker as indeed flagged scraping on their site using a script.

To resolve the blocker we reverted to using selenium and beautiful soup on a Virtual Machine for the automated collection of data

Using Request & Beautiful soup:

Build a web scraping script in google colab that would make multiple requests of web pages on indeed portal based on selected country & number of pages to be scraped.

We got about over 18000 records after this operation, click on this link for more on selenium.

2. The script was built with some level of data cleaning for selecting the relevant data only.

Selenium Challenges:

Selenium was able to crawl the page but it was not able to handle the dynamic elements in the page which would take sometime for loading in page & depends on interactions by person browsing the page.

Data Cleaning:

Data Cleaning for title, eligibility was done by carefully selecting the commonly used titles & eligibility across the globe. The cleaning pipeline extracted the following from the raw job posting details collected from indeed: Job Title, Salary (Upper and lower range), Country, State in country, Years of Experience, Job Position, Age, skillset, educational qualification, Salary Pay frequency

The list of extracted features were still cleaned up to retain some specific roles which were later studied in EDA stage.

Skillset cleaning initially involved NLP algorithm but it didn’t work in our favour & we resorted to matching the text with our own corpus of skillsets

The upper and lower salaries of all rows were adjusted o be dollar

The data cleaning reduced the records gathered from indeed to 1800 rows.

Exploratory Descriptive Analysis:

Using 15036 records for EDA

The above chart shows the missing values in the data set, where it is full purple shows there are no missing values where you see the yellow lines means there are missing values in the columns.

The above chart shows the count of the categorical features present in the data from here we see the categories that are more. For example we see that we Data Scientist have more data in Titles Scraped for also most of the data collected is from UK.

The above chart shows categorical values in relation to the contract type, if you check the plot above you’d see contract type has categories Full time and Contract. From this plot we see India has the most Full time contract type also Senior Level has the most Full Time contract type.

The above chart shows the categorical values in relation with Position, if you check the second plot you will see that Position has 3 categories Senior, Mid and Entry Level. From this plot we see Data Scientist have the most Senior Level also UK has the most Senior Level

Fig above shows Distribution of the Lower Salary

Fig above shows Distribution of the upper salary

Model Training

3 linear regression algorithms were used in a developing a model, that traditional Linear regression model, the lasso regression model and the Ridge regression from Sklearn

We intended to develop a model that makes predictions for Nigeria, UK, US and India for the following roles

  1. Business Analyst
  2. Data analyst
  3. Data scientist
  4. Machine Learning engineer
  5. Web developer

A basic model was developed for test run but the performance was poor due to lack of enough data.

The next point of action is gather more data, more data better prediction.

Another data gathering records for Nigeria was insufficient so we had to drop records for Nigeria

Features used in developing the model are

1. Job title

2. Years of experience

3. Level or position

4. Eligibility or Educational Qualification

5. Country

6 . Job type. Full time or Contract

Missing values for position were determined using the Years of experience required for a job

Missing values for years of experience were filled using the Sklearn interative imputer

Records that had missing values for both the target variables (Upper and lower salary) were dropped

13955 records were used in training all three models

We had initially included skillset as target variables but were later removed because they had no noticeable effect on the predictions made

The model with the best performance was developed using Lasso regression with the an rscore of 0.61 on train set and 0.59 on test set

Streamlit:

Streamlit development was tied with Modeling stage as the trained model was used within streamline for prediction.

Streamlit environment provides an easy to configure & deploy feature which streamlined our efforts of carrying out multiple iterations of training models & testing against user’s provided values.

While it necessary for the prediction to give the expected salary based on currency of the country selected, and exchange rate api was used to perform that task.“Forex.Python” and “exchange rate api” were used

Challenges on currency conversion:

Pro of Forec-python:

Easy to install and code.

Quick to deploy

Limitations of Forex-python:

It doesn’t provide exchange rate for all countries that we are predicting for.

The rate provided are not reliable.

This package has tendency to fail as reported by many people.

Exchange Rate API:

Pros:

Easy to implement

Works with request module.

Faster response time.

Lots of countries are covered.

Relevant links

You can check out codes : Link to Github Repo

You can test the App : Link to the StreamLit App

LinkedIn Links of the Contributors

Opeyemi Seriki

Ovie Iboyitie

Abhishek John Masih

popoola kayode

Victor Oguche

--

--