1 Introduction

A conventional approach to think about bias and discrimination is that it confers preference on someone or treating someone differently than another person. In today’s society, discrimination and prejudice cuts across a wide area of human endeavors and affects countless number of people, especially those designated as underrepresented or minority. Bias and discrimination serve as a thumbprint of socially constructed stereotypes as they are often a product of extensive cultural and societal learning (Gravett 2017). In most cases, people learn cultural attitudes about gender and race right from an early age.

Bias in recruiting becomes noticeable when some applicants get unfair advantage due to their protected attributes, e.g., physical appearance, gender, and ethnicity (Böhm et al. 2020; Erkmen et al. 2021). Studies have shown that women and People of Colour (PoC) have lower chances of getting jobs compared to men and Caucasians having similar qualification and experience. Another study found that substituting African–American sounding names with fictitious White names significantly increased chances of being invited for job interviews and also getting job offers (Bertrand and Mullainathan 2004). Empirical studies in Böhm et al. (2020) and (Österlund 2020) among others have shown that words like challenging or dominant tend to appeal to male applicants in job advertisements. In addition to the above words, there is a wide range of other words that studies have attributed to unconsciously enticing candidates of a particular gender at the expense of the other, thereby enhancing bias and discrimination through job descriptions. This discourages well-qualified candidates from certain groups from applying for jobs.

Previous works have analyzed and categorized bias as it pertains to the recruiting industry (Derous and Decoster 2017; Dnvsls and Kiran 2016; Gaucher et al. 2011). The work presented in Amin et al. (2020) shows the occurrence of bias in the job market and provides an insight into the evolution of the number of job advertisements compared to the occurrence of bias. Similarly, studies have shown that minority candidates may feel discriminated against based on the phrasings of the job advert (Gaucher et al. 2011; Tang et al. 2017). Some words and phrases in job adverts have been found to appeal more to male job seekers than female (Koch et al. 2015). Furthermore, certain racially sensitive words can give minority/immigrant applicants an impression that they do not fit a company’s culture. The authors in Böhm et al. (2020) show the appearance of gender bias in IT job descriptions and propose a tool to determine whether a job description is male, or female oriented based on the writing style in the job description. The essence of these studies is to establish the presence of bias which can lead to the discrimination of job applicants from minority groups.

The existing works are limited to analyzing, categorizing and highlighting the occurrence of bias in the recruitment process. In this paper, we go beyond this and develop machine learning models for identifying and classifying biased and discriminatory language in job descriptions. Five major categories of biased and discriminatory language were identified after a thorough examination of existing works on behavioral science with a focus on bias and discrimination in recruitment. The major research questions addressed in this paper are as follows:

RQ1—In what way is bias and discrimination currently present during the recruitment process and how does this manifest itself in the context of job descriptions?

RQ2—Which state-of-the-art technologies can be employed to achieve optimal results in automatically identifying and classifying biased and discriminatory language in job descriptions?

To answer RQ1, we examine the literature for existing works on behavioral science and in particular, bias and discrimination in the recruiting industry (Sect. 2.1). We also developed a list of 524 unique biased and discriminatory terms (see Sect. 3.3) divided into the five categories: masculine-coded, feminine-coded, exclusive, LGBTQ-coded, demographic and racial language.

To answer RQ2, the list of 524 unique biased and discriminatory terms was used to annotate the corpus of job descriptions from the publicly available Employment Scam Aegean Dataset, EMSCAD (Vidros et al. 2017) dataset into five categories of biased and discriminatory language. A gazetteer-based approach was used to semi-automatically generate an annotated corpus by tagging the biased language terms in the job advertisements. We utilized the combination of linguistic features and most recent state-of-the-art word embedding representations as input features for training machine learning classifiers on the annotated corpus.

The hiring process can be broken down into three phases namely—attraction, selection and retention phases. In the attraction phase, an employer aims to invite applicants by describing the ideal candidate and job role in a job advertisement. The selection phase involves assessing the candidates that applied for a job e.g., reviewing submitted CVs, and matching and shortlisting candidates for interview. While attention is usually focused on the selection phase and machine learning system have been developed to automatically review candidates’ profile and select suitable candidates, the attraction phase has been overlooked. The uniqueness of our approach stems from the fact that our machine learning-based system will directly address the attraction phase by identifying and classifying biased and discriminatory language in job descriptions. The output of our system would empower the recruiters and Human Resources (HR) managers to flag the biased and discriminatory terms and replace them with more inclusive language. Currently available systems exclusively focus on the selection phase of the hiring process and therefore lack the capabilities to eliminate bias at the attraction phase of hiring (Kodiyan 2019; Bendick and Nunes 2012). This obvious gap motivates our work in this direction. To the best of our knowledge, this is the first paper to propose the use of machine learning (ML) and natural language processing (NLP) to tackle bias and discrimination at the attraction phase of hiring.

The rest of the paper is organized as follows. Section 2 presents the literature review on bias and discrimination in recruitment, measures taken to prevent bias and discrimination and the use of natural language processing (NLP) in the selection phase of hiring. Section 3 presents the methodology. The results and analysis are discussed in Sect. 4. Section 5 presents the conclusion and future work.

2 Literature review

2.1 Bias and discrimination in recruitment

Hiring is usually not a single decision but a chain of events that results into a job offer for an applicant. The first step is the talent attraction or sourcing phase where an employer hopes to generate a strong set of applicants. Typically, employers disseminate available job positions, the description of the role as well as ideal profile of candidates. The second step is the selection or screening phase where an employer/recruiter independently or with the aid of some AI algorithms assesses and ranks the various applicants in order of their employability. The outcomes of the steps above might be influenced by bias. In the first step, a candidate from a particular group (e.g., female) may feel not motivated to apply for the advertised job due to the phrasings of the job description (Österlund 2020). In the latter steps, a human recruiter may unconsciously have an ingroup preference for candidates with similar ethnicity or look (Bertrand and Mullainathan 2004) and an algorithm may have encoded societal stereotypes found in the data it was trained with (Raub 2018). The work proposed in this paper seeks to eliminate bias in the attraction phase of hiring.

Bias and discrimination exist in different forms in hiring. It is an implicit inclination or prejudice for or against one person or group. According to Österlund (2020), discrimination can be either implicit (unconscious) or explicit (conscious), and can occur on the basis of gender, ethnicity, sexual orientation, culture, religion, age, etc. The driving force behind these influenced choices is the feeling that one develops during his or her life based on events, but also for example, the media influence (Oates 2018). Discrimination is recognized when someone is treated unfairly in the same situation another person is in and preferential behavior occurs. An example of this is a male and female person both apply for the same job, after which the male is offered the job while the female has better qualifications. Ethnic, gender and age discrimination are the three most common forms of discrimination an applicant faces during the application process. In addition to these three forms, pregnancy discrimination, political views, religious beliefs, disability or other forms of impairment and carrying a disease are also mentioned as common forms of discrimination (Österlund 2020).

Unlike unconscious bias, conscious bias refers to perceptions of individuals or groups in society. The author in Oates (2018) opines that conscious bias may lead to disparate treatment of coworkers and can derail the process of search to bring new people into an organization. An example of this is the preference to work with men rather than women. This may lead to the exclusion of specific people when it comes to opportunities within the labor market, thus ultimately leading to discrimination (Oates 2018). Once a person has been confronted with bias and has taken cognizance of it unconsciously, it is difficult to remove it from their thoughts and way of thinking. Therefore, it is important to prevent bias before it occurs and to take the necessary precautions for it to stimulate awareness (Österlund 2020).

2.2 Measures taken to prevent bias and discrimination

Possible measures and solutions that are mentioned to prevent bias or discrimination in the recruitment process, are for example: anonymizing the application process, the so-called blind hiring whereby personal information of the applicant is removed from the resume so that bias cannot or can hardly take place; intensifying the application of proper recruiting management and interviewing techniques, by only conducting competence based structured interviews; or facilitating an unconscious bias training whereby people are made aware of how unconscious bias can be prevented (Österlund 2020).

The work in Böhm et al. (2020) presents an approach that can be used to measure gender bias in IT job postings. This approach involves the development of a prototype tool based on the training of a machine learning classifier, that allows gender bias to be identified in a job post. In addition, they also developed user interface in which the job description can be reviewed for possible occurrence of bias and discrimination. In this way, recruiters can be informed when gender bias unintentionally occurs, and it can be corrected by applying suggestions proposed by the tool. The tool works by checking the job title and description for the occurrence of gender biased language. On the grounds of this occurrence, a gender-neutrality score is calculated that indicates how gender-neutral a job post is. Findings of this study include the proposal of a keyword repository which encourages or discourages women during the application process. The tool helps recruiters de-bias job posts using suggested language to replace those classified as discriminatory. Early evaluation carried out using three batches of job posts from different job portals was impressive, however, the method for computing the gender-neutrality score requires serious improvement.

The research work in Derous and Ryan (2018) presents a model consisting of three steps that explain the underlying causes for biased resume screening. The three steps included in the model are: application information, impression formation and screening outcomes. Applicant information focuses on the qualifications, but also the non-job-related information that one can determine from these qualifications and characteristics that an applicant has. The second step of the model, impression formation, focuses on how the data retrieved for this purpose is processed by the recruiters. Despite the processes in this step being automatic or unconscious, there can still be a high degree of consciousness present in the decision-making. For example, the assessment of a person's qualifications and characteristics based on the data collected for this purpose. The third phase of this model focuses on the outcomes arising from the resume screening process. Here, perceptions of similarity can influence the way applicants are attracted and retained. The three-stage model shows why resume screening is vulnerable to bias, but not why discrimination occurs in ethical terms. However, they identified a number of factors that contribute to biased resume screening.

For example, they argue that the lack of extensive personal information of a particular applicant can lead to biased decision-making. As a result, people are more likely to be pigeonholed based on stereotypes. In contrast to the previous, they also argue that including non-work-related information of the job applicant, can lead to biased decision-making. Their research recommends that the candidate should aim to provide sufficient information about themselves in the application. However, the non-work-related information should be limited to prevent biased decision-making. They also present a number of interventions that can be used to avert biased resume screening. Some of these include: anonymizing resumes, standardizing processes, training recruiters more intensively and holding them accountable for their hiring decisions.

2.3 Natural language processing (NLP) in the selection phase of hiring

The application of NLP within the recruitment industry is not new, but it is certainly innovative. NLP has been used to analyze interviews to examine the relationship between verbal content and perceived hireability ratings (Muralidhar et al. 2018). Tools based on NLP have also been developed for the automated extraction of relevant information such as skills, work experience and interests from resumes (Sanyal et al. 2017; Wings et al. 2021).

The authors in Guo et al. (2016) developed a tool named RésuMatcher, which utilizes text similarity and machine learning models to find the optimal match between a resume and a job post. Such tools are used in the recruitment industry to select ideal matching candidates for a particular job post.

A similar work is presented in Maheshwary and Misra (2018) where the researchers designed a tool to identify the optimal candidate for a job description using a deep Siamese network (Bromley et al. 1993) (Chopra et al. 2005). This deep Siamese network is based on neural network (NN) to effectively capture high-level text semantics (Adebayo et al. 2017). NN models significantly outperform linear algorithms using commonly used text representations e.g., Word-n-grams, TF-IDF, Bag-of-Words, Bag-of-Means (using the average Word2Vec embedding of the training data), Doc2Vec (Le and Mikolov 2014) and finally a convolutional neural network.

Researchers in Dnvsls and Kiran (2016) proposed a 3-phase algorithm to optimally match candidates’ profiles to job ads. Their system is based on Hadoop framework and involves the data gathering phase where resumes are retrieved and pushed to the Hadoop distributed file system. The data processing and attribute tagging phases utilize some NLP techniques for named-field extraction and named entity recognition for names, emails, and phone numbers etc., respectively. The authors in Amin et al. (2020) proposed a web application to predict the best fit resumes against given job descriptions posted by recruiters, with the main goal to lower the workload of recruiters to prevent them to go through all applicant's details. A comparison between a given job description and a candidate’s resume is facilitated using TF-IDF cosine similarity scores. An obvious limitation is that while calculating the relevant work experience of a job applicant, the years in which an applicant had been studying were sometimes counted.

3 Methodology

In this section, we present the methodology used to develop the NLP pipeline to identify biased and discriminatory language in job descriptions.

3.1 Data collection

The publicly available Employment Scam Aegean Dataset, EMSCAD, was used for this study (Vidros et al. 2017). It was chosen because it contains real-life job descriptions and has been widely used for research purposes. The dataset consists of 17,014 legitimate and 866 fraudulent job advertisements. We decided to use only legitimate job advertisements. The personal information present in the dataset was either anonymized or removed. Due to the limited computational resources available for training, we utilized 3000 job descriptions for experiments.

3.2 Data pre-processing

For the purposes of this study, the focus was solely on the job description of a job advert. Therefore, we only utilized the job description column from the EMSCAD dataset. We used regular expressions to remove the HTML code, special characters and empty lines from the job descriptions. This was followed by sentence tokenization. The sentences which contained less than two words were removed. Finally, the sentences were tokenized into words.

3.3 Data annotation

Due to the unavailability of annotated open-source data on bias, developing a methodological framework for coding biased terms is non-trivial. To navigate this issue, we carried out a data-driven literature review to gain insights into the common categories of bias that are most evident in the attraction phase of recruiting. We also explored different language style guides to understand the linguistic evolution of biased and discriminatory language in the context of hiring. Once this is done, we gained the requisite domain knowledge to categorize the biased phrasings into five major categories according to the linguistic map in our study. The categories include: masculine-coded, feminine-coded, exclusive, LGBTQ-coded, demographic and racial language.

The dataset was annotated using 524 unique biased and discriminatory terms divided into the five categories as mentioned before. Given the large number of unique words that needed to be annotated, and the labor-intensive nature of manual annotation, a semi-automatic method for annotation was chosen. A gazetteer-based approach was used to semi-automatically generate an annotated corpus by tagging the biased language terms in the job advertisements. A thorough manual inspection was done to ensure that the tagged annotations were correct.

3.4 Model design

3.4.1 Feature engineering

3.4.1.1 Linguistic features

Several linguistic features were used for each token. Each feature represents a characteristic property of the word. Table 1 presents an explanation for each linguistic feature.

Table 1 Linguistic features
3.4.1.2 Semantic features

We utilized various state-of-the-art pre-trained word embeddings as textual features for the machine learning classifiers. The different word embeddings which were used are: Word2Vec (Mikolov et al. 2013), BERT (Devlin et al. 2019), ELMo (Peters et al. 2018), GloVe (Pennington et al. 2014), Flair (Akbik et al. 2018) and FastText (Bojanowski et al. 2017). Pre-trained word embeddings were used because the word embeddings trained on the EMSCAD dataset did not demonstrate sufficient semantic quality due to the smaller size of the dataset. Table 2 shows the pre-trained models used for each word embedding.

Table 2 Word embeddings characteristics

For each token, the word embedding vectors were extracted from the corresponding word embedding model using the FlairFootnote 1 library.

3.4.1.3 Feature selection

The aforementioned linguistic features were combined with one of the six semantic features (word embedding) to produce a unique feature set. As a result, six unique feature sets were produced as input to the machine learning classifiers.

3.4.2 Machine learning classifiers

The machine learning classifiers were trained using the six unique feature sets on the training set of annotated job descriptions. The following classifiers were used:

  • Support vector machine (SVM)

  • Random Forest (RF)

  • Logistic regression (LR)

  • Decision tree (DT)

  • Naive Bayes (NB)

  • Multi-layer perceptron classifier (MLP)

For the baseline classifier, Scikit-learn’s Dummy classifier was utilized. By performing parameter optimization using GridSearch,Footnote 2 it was possible to search for the optimal parameters for all the machine learning classifiers. For all the classifiers, the maximum iterations were increased to infinity to ensure that the models are able to converge. All parameters, including the default parameters utilized for model training, are presented in Table 3.

Table 3 Parameter grid

4 Results and analysis

In this section, we present the results of various machine learning models on the EMSCAD dataset. The dataset was divided into 80% training and 20% testing set. The evaluation metrics: accuracy, precision, recall and F1-score were computed for each model. Figure 1 presents the evaluation metrics for various classifiers with different feature sets. The results indicate that the RF classifier with BERT word embeddings as textual feature achieved the best performance. This illustrates that contextual word embedding representations such as BERT had a superior performance over the non-contextual word embeddings such as FastText and Word2vec. We also observe that tree-based (Random Forest and Decision Tree) classifiers had a better performance in classifying biased and discriminatory language as compared to the remaining classifiers. Among the textual features, word embedding representations BERT, FastText and ELMo in combination with the RF classifier had the best performance. This was followed by FastText, ELMo and Flair word embeddings in combination with the DT classifier.

Fig. 1
figure 1

Machine learning models performance metrics

We further evaluate the various machine learning classifiers with different word embedding representations as features using tenfolds cross-validation. Figure 2 presents the macro-averages of the precision, recall and F1-score over tenfolds cross-validation. The results of the tenfolds cross-validation indicate that the RF classifier with FastText word embeddings had the best performance. Figures 3, 4, 5 and 6 present the individual results for accuracy, precision, recall and F1-score for the various models.

Fig. 2
figure 2

Cross-validated Machine learning performance metrics

Fig. 3
figure 3

Machine learning models—accuracy

Fig. 4
figure 4

Machine learning models—precision

Fig. 5
figure 5

Machine learning models—recall

Fig. 6
figure 6

Machine learning models—F1

Figures 7 and 8 present the confusion matrices of the two best performing models: (1) RF classifier with FastText word embeddings and (2) RF classifier with BERT word embeddings. The results in Figs. 7 and 8 indicate that all the five classes of biased and discriminatory language were distinguishable from each other.

Fig. 7
figure 7

Confusion matrix Random Forest—FastText

Fig. 8
figure 8

Confusion matrix Random Forest—BERT

We observe a linear improvement in the performance of a sample of our models as training size increases in our primary experiment. We wanted to see if the behavior was by any chance related to these particular models or statistically grounded across all our implemented models. To validate this improvement, we included additional data (3000 additional job descriptions) and ran a new experiment with for the lightweight classifiers (DT, LR and NB).

The results obtained from the new experiment are shown in Fig. 9 for the regular models (80% training set and 20% test set), and Fig. 10 for the tenfold cross-validated models. We see that the DT classifier which uses BERT word embeddings produced the best performance: 0.98977, 0.99587 and 0.99277 for the precision, recall, and F1-score respectively. In fact, when compared to the result initially obtained by the best performing model in our first experiment, i.e., BERT—RF as shown in Fig. 1, we can see an improvement since the previous performance scores obtained were 0.98557, 0.98862 and 0.98544 equally for the precision, recall, and F1-score respectively.

Fig. 9
figure 9

Extended machine learning models performance metrics

Fig. 10
figure 10

Extended cross-validated Machine learning performance metrics

This analysis reinforces our belief that even the strong performance we have observed across the board can be further improved. However, it is not currently clear if, by any stroke of chance, the data samples utilized for evaluation might have been simplistic in the sense that they represent trivial cases. To the best of our knowledge, we have avoided cherry-picking by carrying out extensive random-sampling to select the evaluation set. However, given the small ratio of the entire evaluation set when compared to the size of our full data, we may not confidently rule out that this could have had an impact. In any case, we are leaving this for future study where we hope to perform a more comprehensive experiment on our entire dataset including carrying out an extensive ablation study with error analysis on the result.

5 Conclusion and future research

This paper presented a machine learning approach to identify five major categories of bias and discriminatory language in job advertisements. We prepared a list of unique biased and discriminatory terms after examining the literature on behavioral works related to bias in recruitment. This list was used to semi-automatically generate an annotated corpus by the tagging the biased language terms (using a gazetteer-based approach) in the job advertisements of the publicly available Employment Scam Aegean Dataset, EMSCAD. This annotated corpus was used to train state-of-the-art machine learning classifiers to identify five different categories of biased and discriminatory language. We utilized a combination of linguistic features and most recent state-of-the-art word embedding representations as textual features to capture the natural language semantics of biased language. These features were fed into the machine learning classifiers. The results indicate that the Random Forest classifier with FastText word embeddings achieved the best performance with tenfold cross-validation. Overall, this work presents a major contribution in the attention phase of hiring and empowering recruiters by identifying and classifying discriminatory language in job advertisements using a machine learning-based approach. The output of such a tool can be used to flag biased and discriminatory language and encourage recruiters to write more inclusive job advertisements.

The future research can benefit by incorporating additional categories of biased and discriminatory language. We also plan to train and evaluate the system on other documents such as resumes, academic papers, news articles, social media posts, etc. Our long-term goal is to extend the current system so it's able to identify and classify biased and discriminatory language on any source of text written in the natural language. We also plan to develop metrics to produce a diversity and inclusivity score using the output of the system.