We will be using a resumes database to develop a content-based recommender system using techniques and tools of NLP.
In particular, a content-based recommender system tries to guess the features or behavior of a user given the item's features. In our case, the item's features are the description on the resumes and we will be developing a recommender system based on a job post description writen in natural language. The idea is to recreate the search of a job recruiter and provide recommendations of different profiles based on their resumes' information.
Disclaimer: this project is an adaptation of part of my thesis presented for the MSc in Business Analytics at Imperial College London in August 2022
We are going to be working with a database of more than 84 thousands resumes written in Spanish language. The resumes come from a platform with closed and open-ended cells. We are going to use the open-ended information written in natural language, such as education and experience, to create our recommender system.
We plan to do the standard pre-processing steps, such as the elimination of stop-words, removing punctuation, lowercasing, stemming and tokenizing. Then we will try different approaches for the creation of the recommendation system:
For the similarity measurement, the first two methods use Cosine Similarity, and the last one Euclidean Distance
Finally, as in this case we are working in an unsupervised learning, we do not count with a "ground truth", i.e., we do not have a way to objectively contrast our outputs. So in this case we should evaluate the different methods by manual inspection
# libraries
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
# load dataframes
df_studies = pd.read_csv('qtyadatos_cvestudio.csv')
df_experience = pd.read_csv('qtyadatos_cvexperiencia.csv')
df_experience.describe()
## we will work with the experience dataframe. 'curriculum_id' is the column that identifies user id
df_experience3 = df_experience[['curriculum_id','puesto','responsabilidades']].copy()
df_experience3.head()
### we group the table so we have one row per user
df_experience3['g'] = df_experience3.groupby('curriculum_id').cumcount().add(1)
df_experience3 = (df_experience3.set_index(['curriculum_id','g'])
.unstack(fill_value=-1)
.sort_index(level=1, axis=1))
df_experience3.columns = df_experience3.columns.map(lambda x: f'{x[0]}{x[1]}')
# replace -1 values with NaN
df_experience3 = df_experience3.replace(-1, np.NaN)
# concatenate all the columns
columns_names = df_experience3.columns
df_experience3['experience'] = df_experience3[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_experience4 = df_experience3[['experience']]
df_experience4
### we do the same with the studies information
df_studies2 = df_studies[['curriculum_id', 'titulo']].copy()
df_studies2
### we group the table so we have one row per user
df_studies2['g'] = df_studies2.groupby('curriculum_id').cumcount().add(1)
df_studies2 = (df_studies2.set_index(['curriculum_id','g'])
.unstack(fill_value=-1)
.sort_index(level=1, axis=1))
df_studies2.columns = df_studies2.columns.map(lambda x: f'{x[0]}{x[1]}')
# replace -1 values with NaN
df_studies2 = df_studies2.replace(-1, np.NaN)
# concatenate all the columns
columns_names = df_studies2.columns
df_studies2['studies'] = df_studies2[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_studies3 = df_studies2[['studies']]
df_studies3
df_experience5 = pd.merge(df_experience4, df_studies3, left_index=True, right_index=True)
df_experience5
# concatenate columns
columns_names = df_experience5.columns
df_experience5['experience_final'] = df_experience5[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_experience6 = df_experience5[['experience_final']]
df_experience6
df_experience6.reset_index()
df_experience7 = df_experience6.copy()
## tokenization
nltk.download('punkt');
def tokenize(column):
"""Tokenizes a Pandas dataframe column and returns a list of tokens.
Args:
column: Pandas dataframe column (i.e. df['text']).
Returns:
tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
"""
tokens = nltk.word_tokenize(column)
return [w for w in tokens if w.isalpha()]
df_experience7['tokenized'] = df_experience7.apply(lambda x: tokenize(x['experience_final']), axis=1)
df_experience7[['experience_final', 'tokenized']].head()
### Stopword removal
nltk.download('stopwords');
# we use the spanish stopword package and add "nan" as a stopword to be removed
stop_words_sp = set(stopwords.words('spanish'))
stop_words_sp = stop_words_sp.union(['nan'])
def remove_stopwords(tokenized_column):
"""Return a list of tokens with English stopwords removed.
Args:
column: Pandas dataframe column of tokenized data from tokenize()
Returns:
tokens (list): Tokenized list with stopwords removed.
"""
stops = set(stop_words_sp)
return [word for word in tokenized_column if not word in stops]
df_experience7['stopwords_removed'] = df_experience7.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
df_experience7[['experience_final', 'stopwords_removed']].head()
### stemming
# again we use the spanish package
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
def apply_stemming(tokenized_column):
"""Return a list of tokens with Porter stemming applied.
Args:
column: Pandas dataframe column of tokenized data with stopwords removed.
Returns:
tokens (list): Tokenized list with words Porter stemmed.
"""
stemmer = SnowballStemmer('spanish')
return [stemmer.stem(word) for word in tokenized_column]
df_experience7['porter_stemmed'] = df_experience7.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)
df_experience7[['experience_final', 'porter_stemmed']].head()
## rejoin words for future vectorization
def rejoin_words(tokenized_column):
"""Rejoins a tokenized word list into a single string.
Args:
tokenized_column (list): Tokenized column of words.
Returns:
string: Single string of untokenized words.
"""
return ( " ".join(tokenized_column))
df_experience7['rejoined'] = df_experience7.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
df_experience7[['experience_final', 'rejoined']].head()
df_experience8 = df_experience7[['rejoined']].reset_index().copy()
df_experience8
### add column with original description to get an easier compehension of the profile at the end
df_experience_compare = df_experience7.copy()
df_experience_compare.head()
# we remove stop words and 'nan' from the description
stop_words_sp = stop_words_sp.union(['nan'])
def remove_stopwords(tokenized_column):
"""Return a list of tokens with English stopwords removed.
Args:
column: Pandas dataframe column of tokenized data from tokenize()
Returns:
tokens (list): Tokenized list with stopwords removed.
"""
stops = set(stop_words_sp)
return [word for word in tokenized_column if not word in stops]
df_experience_compare['stopwords_removed'] = df_experience_compare.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
df_experience_compare['rejoined'] = df_experience_compare.apply(lambda x: rejoin_words(x['stopwords_removed']), axis=1)
df_experience_compare2 = df_experience_compare.reset_index()
df_experience_compare2 = df_experience_compare2[['curriculum_id', 'rejoined']]
df_experience_compare2.head()
df_experience_compare2 = df_experience_compare2.rename(columns={'rejoined': 'descripcion_perfil_original'})
df_experience8 = pd.merge(left=df_experience8, right=df_experience_compare2[['descripcion_perfil_original','curriculum_id']], how='left', left_on='curriculum_id', right_on='curriculum_id')
df_experience8
We will also be using the information of the "profile search dataset" which has job post descriptions. We will use them in the future to test our recommendation system
# profiles search dataset
df_avi = pd.read_csv('qtyadatos_aviso.csv')
df_avi.head()
df_avi2 = df_avi[['id', 'empleo_ofrecido', 'descripcion']].copy()
# concatenate two columns
columns_names = ['empleo_ofrecido', 'descripcion']
df_avi2['descripcion2'] = df_avi2[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_avi4 = df_avi2[['id','descripcion2']]
df_avi4
### preprocessing descripcion2 column
#tokenization
df_avi4['tokenized'] = df_avi4.apply(lambda x: tokenize(x['descripcion2']), axis=1)
# remove stopwords
df_avi4['stopwords_removed'] = df_avi4.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
# stemming
df_avi4['porter_stemmed'] = df_avi4.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)
# rejoin
df_avi4['rejoined'] = df_avi4.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
df_avi4.head()
df_avi5 = df_avi4[['id', 'rejoined', 'descripcion2']]
df_avi5 = df_avi5.rename(columns={'rejoined': 'descripcion', 'descripcion2': 'descripcion_original'})
df_avi5.head()
We will start with a general job search, like something related to accounting and administration duties
# work with job search id 2696
search_text = df_avi5[17:18]
search_text
## see complete description
search_text.iloc[0]['descripcion_original']
## Extract features from text with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_experience = tfidf_vectorizer.fit_transform((df_experience7['rejoined']))
tfidf_experience
## To calculate the cosine similarity in python we will use cosine_similarity from sklearn packagef
rom sklearn.metrics.pairwise import cosine_similarity
user_tfidf = tfidf_vectorizer.transform(search_text['descripcion'])
cos_similarity_tfidf = map(lambda x: cosine_similarity(user_tfidf, x), tfidf_experience)
output2 = list(cos_similarity_tfidf)
# Function to get the top-N recomendations order by score
def get_recommendation(top, df_all, scores):
recommendation = pd.DataFrame(columns = ['ProfileID', 'title', 'score'])
count = 0
for i in top:
recommendation.at[count, 'ProfileID'] = df_experience8['curriculum_id'][i]
recommendation.at[count, 'title'] = df_experience8['rejoined'][i]
recommendation.at[count, 'score'] = scores[count]
count += 1
return recommendation
# The top recommendations using TF-IDF
top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]
list_scores = [output2[i][0][0] for i in top]
get_recommendation(top,df_experience8, list_scores)
# if we want to check the original description, we merge with the original description column
result = get_recommendation(top,df_experience8, list_scores)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result
## countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
count_experience = count_vectorizer.fit_transform((df_experience7['rejoined']))
count_experience
from sklearn.metrics.pairwise import cosine_similarity
user_count = count_vectorizer.transform(search_text['descripcion'])
cos_similarity_countv = map(lambda x: cosine_similarity(user_count, x),count_experience)
output2 = list(cos_similarity_countv)
top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]
list_scores = [output2[i][0][0] for i in top]
results = get_recommendation(top, df_experience8, list_scores)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result
from sklearn.neighbors import NearestNeighbors
n_neighbors = 11
KNN = NearestNeighbors(n_neighbors=n_neighbors, p=2)
KNN.fit(tfidf_experience)
NNs = KNN.kneighbors(user_tfidf, return_distance=True)
top = NNs[1][0][1:]
index_score = NNs[0][0][1:]
results = get_recommendation(top, df_experience8, index_score)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result
We see that the top recommendation with each method doesn't vary much. This could be because our job search is too general. Now we will try a more specific job post and a longer description
We will try a longer and more specific job post description
df_avi5[1467:1468].iloc[0]['descripcion_original']
search_text = df_avi5[1467:1468]
search_text
### TF-IDF
from sklearn.metrics.pairwise import cosine_similarity
user_tfidf = tfidf_vectorizer.transform(search_text['descripcion'])
cos_similarity_tfidf = map(lambda x: cosine_similarity(user_tfidf, x), tfidf_experience)
output2 = list(cos_similarity_tfidf)
# The top recommendations using TF-IDF
top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]
list_scores = [output2[i][0][0] for i in top]
result = get_recommendation(top,df_experience8, list_scores).copy()
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result
## countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
count_experience = count_vectorizer.fit_transform((df_experience7['rejoined']))
count_experience
from sklearn.metrics.pairwise import cosine_similarity
user_count = count_vectorizer.transform(search_text['descripcion'])
cos_similarity_countv = map(lambda x: cosine_similarity(user_count, x),count_experience)
output2 = list(cos_similarity_countv)
top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]
list_scores = [output2[i][0][0] for i in top]
result = get_recommendation(top, df_experience8, list_scores)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result
### KNN
from sklearn.neighbors import NearestNeighbors
n_neighbors = 11
KNN = NearestNeighbors(n_neighbors=n_neighbors, p=2)
KNN.fit(tfidf_experience)
NNs = KNN.kneighbors(user_tfidf, return_distance=True)
top = NNs[1][0][1:]
index_score = NNs[0][0][1:]
result = get_recommendation(top, df_experience8, index_score)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result
Now we see more varied results between the different methods. As we do not count with a ground truth to automatically contrast our results, we should evaluate the results by manual inspection.
In particular, we se that Count Vectorizer method is the most efective recommender, as it prioritizes word frequency and, in this case, results in an efective strategy. On the other hand, TF-IDF Vectorizer adds a weight in the frequency of tokens and seems to give higher rankings to profiles that slightly deviates from what we are looking for. Finally, KNN, although it is not the most effective, it is the least computational and time-consuming option.