Development of a content-based recommender system of resumes using Natural Language Processing (NLP)¶

We will be using a resumes database to develop a content-based recommender system using techniques and tools of NLP.

In particular, a content-based recommender system tries to guess the features or behavior of a user given the item's features. In our case, the item's features are the description on the resumes and we will be developing a recommender system based on a job post description writen in natural language. The idea is to recreate the search of a job recruiter and provide recommendations of different profiles based on their resumes' information.

Disclaimer: this project is an adaptation of part of my thesis presented for the MSc in Business Analytics at Imperial College London in August 2022

Steps¶

We are going to be working with a database of more than 84 thousands resumes written in Spanish language. The resumes come from a platform with closed and open-ended cells. We are going to use the open-ended information written in natural language, such as education and experience, to create our recommender system.

We plan to do the standard pre-processing steps, such as the elimination of stop-words, removing punctuation, lowercasing, stemming and tokenizing. Then we will try different approaches for the creation of the recommendation system:

Term Frequency-Inverse Document Frequency (TF-IDF)
Count Vectorizer
K-Nearest Neighbors (KNN)

For the similarity measurement, the first two methods use Cosine Similarity, and the last one Euclidean Distance

Finally, as in this case we are working in an unsupervised learning, we do not count with a "ground truth", i.e., we do not have a way to objectively contrast our outputs. So in this case we should evaluate the different methods by manual inspection

# libraries

import numpy as np
import pandas as pd

from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# load dataframes

df_studies = pd.read_csv('qtyadatos_cvestudio.csv')
df_experience = pd.read_csv('qtyadatos_cvexperiencia.csv')

df_experience.describe()

Preparation of experience and studies dataset¶

## we will work with the experience dataframe. 'curriculum_id' is the column that identifies user id

df_experience3 = df_experience[['curriculum_id','puesto','responsabilidades']].copy()
df_experience3.head()

### we group the table so we have one row per user

df_experience3['g'] = df_experience3.groupby('curriculum_id').cumcount().add(1)

df_experience3 = (df_experience3.set_index(['curriculum_id','g'])
        .unstack(fill_value=-1)
        .sort_index(level=1, axis=1))

df_experience3.columns = df_experience3.columns.map(lambda x: f'{x[0]}{x[1]}')

# replace -1 values with NaN
df_experience3 = df_experience3.replace(-1, np.NaN)

# concatenate all the columns

columns_names = df_experience3.columns

df_experience3['experience'] = df_experience3[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_experience4 = df_experience3[['experience']]
df_experience4

### we do the same with the studies information

df_studies2 = df_studies[['curriculum_id', 'titulo']].copy()
df_studies2

### we group the table so we have one row per user

df_studies2['g'] = df_studies2.groupby('curriculum_id').cumcount().add(1)

df_studies2 = (df_studies2.set_index(['curriculum_id','g'])
        .unstack(fill_value=-1)
        .sort_index(level=1, axis=1))

df_studies2.columns = df_studies2.columns.map(lambda x: f'{x[0]}{x[1]}')

# replace -1 values with NaN
df_studies2 = df_studies2.replace(-1, np.NaN)

# concatenate all the columns

columns_names = df_studies2.columns

df_studies2['studies'] = df_studies2[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_studies3 = df_studies2[['studies']]
df_studies3

df_experience5 = pd.merge(df_experience4, df_studies3, left_index=True, right_index=True)
df_experience5

# concatenate columns

columns_names = df_experience5.columns

df_experience5['experience_final'] = df_experience5[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_experience6 = df_experience5[['experience_final']]
df_experience6

df_experience6.reset_index()

Preprocessing and cleaning¶

df_experience7 = df_experience6.copy()

Tokenization¶

## tokenization

nltk.download('punkt');

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hernan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.

    Args:
        column: Pandas dataframe column (i.e. df['text']).

    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]

    """

    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]

df_experience7['tokenized'] = df_experience7.apply(lambda x: tokenize(x['experience_final']), axis=1)
df_experience7[['experience_final', 'tokenized']].head()

Stopword removal¶

### Stopword removal

nltk.download('stopwords');

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hernan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

# we use the spanish stopword package and add "nan" as a stopword to be removed

stop_words_sp = set(stopwords.words('spanish'))

stop_words_sp = stop_words_sp.union(['nan'])

def remove_stopwords(tokenized_column):
    """Return a list of tokens with English stopwords removed. 

    Args:
        column: Pandas dataframe column of tokenized data from tokenize()

    Returns:
        tokens (list): Tokenized list with stopwords removed.

    """
    stops = set(stop_words_sp)
    return [word for word in tokenized_column if not word in stops]

df_experience7['stopwords_removed'] = df_experience7.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
df_experience7[['experience_final', 'stopwords_removed']].head()

Stemming¶

### stemming

# again we use the spanish package

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')

def apply_stemming(tokenized_column):
    """Return a list of tokens with Porter stemming applied.

    Args:
        column: Pandas dataframe column of tokenized data with stopwords removed.

    Returns:
        tokens (list): Tokenized list with words Porter stemmed.

    """

    stemmer = SnowballStemmer('spanish') 
    return [stemmer.stem(word) for word in tokenized_column]

df_experience7['porter_stemmed'] = df_experience7.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)
df_experience7[['experience_final', 'porter_stemmed']].head()

## rejoin words for future vectorization

def rejoin_words(tokenized_column):
    """Rejoins a tokenized word list into a single string. 
    
    Args:
        tokenized_column (list): Tokenized column of words. 
        
    Returns:
        string: Single string of untokenized words. 
    """
    
    return ( " ".join(tokenized_column))

df_experience7['rejoined'] = df_experience7.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
df_experience7[['experience_final', 'rejoined']].head()

df_experience8 = df_experience7[['rejoined']].reset_index().copy()
df_experience8

### add column with original description to get an easier compehension of the profile at the end

df_experience_compare  = df_experience7.copy()
df_experience_compare.head()

# we remove stop words and 'nan' from the description

stop_words_sp = stop_words_sp.union(['nan'])

def remove_stopwords(tokenized_column):
    """Return a list of tokens with English stopwords removed. 

    Args:
        column: Pandas dataframe column of tokenized data from tokenize()

    Returns:
        tokens (list): Tokenized list with stopwords removed.

    """
    stops = set(stop_words_sp)
    return [word for word in tokenized_column if not word in stops]

df_experience_compare['stopwords_removed'] = df_experience_compare.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
df_experience_compare['rejoined'] = df_experience_compare.apply(lambda x: rejoin_words(x['stopwords_removed']), axis=1)
df_experience_compare2 = df_experience_compare.reset_index()
df_experience_compare2 = df_experience_compare2[['curriculum_id', 'rejoined']]
df_experience_compare2.head()

df_experience_compare2 = df_experience_compare2.rename(columns={'rejoined': 'descripcion_perfil_original'})

df_experience8 = pd.merge(left=df_experience8, right=df_experience_compare2[['descripcion_perfil_original','curriculum_id']], how='left', left_on='curriculum_id', right_on='curriculum_id')
df_experience8

Preparation of profile search dataset¶

We will also be using the information of the "profile search dataset" which has job post descriptions. We will use them in the future to test our recommendation system

# profiles search dataset

df_avi = pd.read_csv('qtyadatos_aviso.csv')
df_avi.head()

df_avi2 = df_avi[['id', 'empleo_ofrecido', 'descripcion']].copy()

# concatenate two columns

columns_names = ['empleo_ofrecido', 'descripcion']

df_avi2['descripcion2'] = df_avi2[columns_names].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df_avi4 = df_avi2[['id','descripcion2']]
df_avi4

### preprocessing descripcion2 column

#tokenization
df_avi4['tokenized'] = df_avi4.apply(lambda x: tokenize(x['descripcion2']), axis=1)

# remove stopwords
df_avi4['stopwords_removed'] = df_avi4.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)

# stemming
df_avi4['porter_stemmed'] = df_avi4.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)

# rejoin
df_avi4['rejoined'] = df_avi4.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
df_avi4.head()

C:\Users\hernan\AppData\Local\Temp/ipykernel_20708/2903314473.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_avi4['tokenized'] = df_avi4.apply(lambda x: tokenize(x['descripcion2']), axis=1)

df_avi5 = df_avi4[['id', 'rejoined', 'descripcion2']]
df_avi5 = df_avi5.rename(columns={'rejoined': 'descripcion', 'descripcion2': 'descripcion_original'})
df_avi5.head()

Trying different recommendation system approaches¶

General job post¶

We will start with a general job search, like something related to accounting and administration duties

# work with job search id 2696

search_text = df_avi5[17:18]
search_text

## see complete description
search_text.iloc[0]['descripcion_original']

'Contador Inmobiliaria busca Contador para tareas administrativas.'

TF-IDF¶

## Extract features from text with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

tfidf_experience = tfidf_vectorizer.fit_transform((df_experience7['rejoined']))
tfidf_experience

<50719x32244 sparse matrix of type '<class 'numpy.float64'>'
	with 1316622 stored elements in Compressed Sparse Row format>

## To calculate the cosine similarity in python we will use cosine_similarity from sklearn packagef

rom sklearn.metrics.pairwise import cosine_similarity

user_tfidf = tfidf_vectorizer.transform(search_text['descripcion'])

cos_similarity_tfidf = map(lambda x: cosine_similarity(user_tfidf, x), tfidf_experience)

output2 = list(cos_similarity_tfidf)

# Function to get the top-N recomendations order by score

def get_recommendation(top, df_all, scores):
    recommendation = pd.DataFrame(columns = ['ProfileID',  'title', 'score'])
    count = 0
    for i in top:
        recommendation.at[count, 'ProfileID'] = df_experience8['curriculum_id'][i]
        recommendation.at[count, 'title'] = df_experience8['rejoined'][i]
        recommendation.at[count, 'score'] =  scores[count]
        count += 1
    return recommendation

# The top recommendations using TF-IDF

top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]

list_scores = [output2[i][0][0] for i in top]

get_recommendation(top,df_experience8, list_scores)

# if we want to check the original description, we merge with the original description column

result = get_recommendation(top,df_experience8, list_scores)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result

CountVectorizer¶

## countvectorizer

from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

count_experience = count_vectorizer.fit_transform((df_experience7['rejoined']))
count_experience

<50719x32244 sparse matrix of type '<class 'numpy.int64'>'
	with 1316622 stored elements in Compressed Sparse Row format>

from sklearn.metrics.pairwise import cosine_similarity

user_count = count_vectorizer.transform(search_text['descripcion'])

cos_similarity_countv = map(lambda x: cosine_similarity(user_count, x),count_experience)

output2 = list(cos_similarity_countv)

top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]

list_scores = [output2[i][0][0] for i in top]

results = get_recommendation(top, df_experience8, list_scores)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result

KNN¶

from sklearn.neighbors import NearestNeighbors
n_neighbors = 11
KNN = NearestNeighbors(n_neighbors=n_neighbors, p=2)
KNN.fit(tfidf_experience)
NNs = KNN.kneighbors(user_tfidf, return_distance=True)

top = NNs[1][0][1:]
index_score = NNs[0][0][1:]

results = get_recommendation(top, df_experience8, index_score)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result

Observations with general job post¶

We see that the top recommendation with each method doesn't vary much. This could be because our job search is too general. Now we will try a more specific job post and a longer description

Specific job post¶

We will try a longer and more specific job post description

df_avi5[1467:1468].iloc[0]['descripcion_original']

'Ingeniero de Desarrollo de Producto MAHLE Argentina S.A. seleccionará Ingeniero de Desarrollo de Producto, sus principales responsabilidades serán:\r\n-\tDesarrollar y participar de proyectos de desarrollo de nuevos productos en el ámbito de ingeniería de aplicación, de ingeniería de procesos y de los grupos de ingeniería de los Centros Tecnológicos MAHLE. \r\n-\tElaborar informes técnicos de componentes para clientes internos y externos.\r\n-\tBrindar soporte a ingeniería de aplicación, ingeniería de manufactura y planta, en asuntos de amplitud global en cuanto a nivel técnico. \r\n-\tBrindar soporte técnico al cliente externo en asuntos relacionados a los productos y sus aplicaciones, con el objeto de mantener la calidad y obtener información sobre los productos en los clientes.\r\n\r\n-\t Coordinar y desarrollar evaluaciones técnicas de componentes en lo que concierne a adecuación a las especificaciones de proyecto, proceso y performance.\r\n\r\nLa búsqueda está orientada a profesionales graduados de Ingenierías Mecánicas, Electromecánicas, con conocimientos científicos básicos en motores de combustión interna, aplicación de válvulas, características de procesos de fabricación, conocimiento del negocio autopartista, física básica, normas técnicas, técnicas análisis solución de problemas, microinformática CAD, gestión de costos y dominio fluido de idioma inglés (excluyente), con experiencia en posiciones similares, preferentemente en industrias autopartistas o automotrices. \r\n\r\nLa empresa ofrece excelentes condiciones contractuales y posibilidad desarrollo en base a las políticas establecidas MAHLE.\r\n\r\nLos interesados deberán enviar C. V. vía mail a rrhh.rafaela@ar.mahle.com o ingresando a nuestra página web www.jobs.mahle.com/Argentina/es/home_es'

search_text = df_avi5[1467:1468]
search_text

TF-IDF¶

### TF-IDF
from sklearn.metrics.pairwise import cosine_similarity

user_tfidf = tfidf_vectorizer.transform(search_text['descripcion'])

cos_similarity_tfidf = map(lambda x: cosine_similarity(user_tfidf, x), tfidf_experience)

output2 = list(cos_similarity_tfidf)

# The top recommendations using TF-IDF

top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]

list_scores = [output2[i][0][0] for i in top]

result = get_recommendation(top,df_experience8, list_scores).copy()
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result

CountVectorizer¶

## countvectorizer

from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

count_experience = count_vectorizer.fit_transform((df_experience7['rejoined']))
count_experience

<50719x32244 sparse matrix of type '<class 'numpy.int64'>'
	with 1316622 stored elements in Compressed Sparse Row format>

from sklearn.metrics.pairwise import cosine_similarity

user_count = count_vectorizer.transform(search_text['descripcion'])

cos_similarity_countv = map(lambda x: cosine_similarity(user_count, x),count_experience)

output2 = list(cos_similarity_countv)

top = sorted(range(len(output2)), key=lambda i: output2[i], reverse=True)[:10]

list_scores = [output2[i][0][0] for i in top]

result = get_recommendation(top, df_experience8, list_scores)
table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result

KNN¶

### KNN

from sklearn.neighbors import NearestNeighbors
n_neighbors = 11
KNN = NearestNeighbors(n_neighbors=n_neighbors, p=2)
KNN.fit(tfidf_experience)
NNs = KNN.kneighbors(user_tfidf, return_distance=True)

top = NNs[1][0][1:]
index_score = NNs[0][0][1:]

result = get_recommendation(top, df_experience8, index_score)

table_result = pd.merge(left=result, right=df_experience8[['descripcion_perfil_original','curriculum_id']], how='left', left_on='ProfileID', right_on='curriculum_id')
table_result

Observations with specific job post¶

Now we see more varied results between the different methods. As we do not count with a ground truth to automatically contrast our results, we should evaluate the results by manual inspection.

In particular, we se that Count Vectorizer method is the most efective recommender, as it prioritizes word frequency and, in this case, results in an efective strategy. On the other hand, TF-IDF Vectorizer adds a weight in the frequency of tokens and seems to give higher rankings to profiles that slightly deviates from what we are looking for. Finally, KNN, although it is not the most effective, it is the least computational and time-consuming option.

	id	personas_cargo	area_id	curriculum_id	localidad_id	pais_id
count	115267.000000	1.060700e+05	115267.000000	115267.000000	113202.000000	115267.000000
mean	60261.794113	2.019581e+04	12.270919	36952.284062	3786.152603	1.262634
std	33901.765153	6.525009e+06	10.021272	23653.433691	1459.972429	5.372680
min	815.000000	-5.000000e+01	1.000000	3.000000	2.000000	1.000000
25%	31121.500000	0.000000e+00	2.000000	16681.500000	2027.000000	1.000000
50%	60672.000000	0.000000e+00	10.000000	33803.000000	4749.000000	1.000000
75%	89577.500000	0.000000e+00	20.000000	55987.000000	4749.000000	1.000000
max	118446.000000	2.125032e+09	40.000000	86437.000000	5457.000000	195.000000

	curriculum_id	puesto	responsabilidades
0	37	Administracion	Generales
1	23	Docente reemplazante de musica	DOCENTE REEMPLAZANTE DE MUSICA EN ESCUELA PRIM...
2	23	Secretaria	Telefonista, venta y reserva de pasajes, estad...
3	20	Administrativa	Telefonista, Cobro a deudores, Facturaci�n y A...
4	20	Administrativa	Recepcionista y Telefonista, Facturaci�n y Cob...

	experience
curriculum_id
3	Secretario Muchas nan nan nan nan nan nan nan ...
9	Vendedor de salón Manejo de caja diaria, depós...
10	Pasante en el área “Facturación y registración...
11	Recepción y acreditaciones Recepción y acredit...
12	recepci�n - administracion facturacion, atenc...
...	...
86432	Chofer CHOFER DE PRIMERA CATEGORÍA \n- BATEA\...
86434	Operario de fosa Operario de fosa \nDetección ...
86435	Soldador y operario Soldadura. Panadero Me fo...
86436	Repositor Reposición, recepción, atención al p...
86437	Encargada Personal a cargo en tareas producti...

	curriculum_id	titulo
0	12	Econom�a y Gesti�n de las Organizaciones con i...
1	16418	Comunicación
2	14	Administracion de empresa
3	17	T�cnico en Gesti�n Organizacional. Polimodal, ...
4	17	Abogado
...	...	...
98759	86435	Ninguno
98760	86437	Técnica superior en gestión de empresas y serv...
98761	86436	Licenciado en Diseño Industrial
98762	44941	Bachillerato
98763	85662	Orientación en Administración y Economía

	studies
curriculum_id
3	ninguno Economía y Gestión Pura Lic en RRHH na...
9	Bachiller nan nan nan nan nan nan nan nan nan ...
10	Contador Público nan nan nan nan nan nan nan n...
11	Analista en Relaciones Públicas Licenciatura e...
12	Econom�a y Gesti�n de las Organizaciones con i...
...	...
86432	no poseo nan nan nan nan nan nan nan nan nan n...
86434	Bachiller nan nan nan nan nan nan nan nan nan ...
86435	Ninguno nan nan nan nan nan nan nan nan nan na...
86436	Licenciado en Diseño Industrial nan nan nan na...
86437	Técnica superior en gestión de empresas y serv...

	experience_final	tokenized
curriculum_id
3	Secretario Muchas nan nan nan nan nan nan nan ...	[Secretario, Muchas, nan, nan, nan, nan, nan, ...
9	Vendedor de salón Manejo de caja diaria, depós...	[Vendedor, de, salón, Manejo, de, caja, diaria...
10	Pasante en el área “Facturación y registración...	[Pasante, en, el, área, Facturación, y, regist...
11	Recepción y acreditaciones Recepción y acredit...	[Recepción, y, acreditaciones, Recepción, y, a...
12	recepci�n - administracion facturacion, atenc...	[administracion, facturacion, atencion, al, pu...

	experience_final	stopwords_removed
curriculum_id
3	Secretario Muchas nan nan nan nan nan nan nan ...	[Secretario, Muchas, ninguno, Economía, Gestió...
9	Vendedor de salón Manejo de caja diaria, depós...	[Vendedor, salón, Manejo, caja, diaria, depósi...
10	Pasante en el área “Facturación y registración...	[Pasante, área, Facturación, registración, com...
11	Recepción y acreditaciones Recepción y acredit...	[Recepción, acreditaciones, Recepción, acredit...
12	recepci�n - administracion facturacion, atenc...	[administracion, facturacion, atencion, public...

	experience_final	porter_stemmed
curriculum_id
3	Secretario Muchas nan nan nan nan nan nan nan ...	[secretari, much, ningun, econom, gestion, pur...
9	Vendedor de salón Manejo de caja diaria, depós...	[vendedor, salon, manej, caj, diari, deposit, ...
10	Pasante en el área “Facturación y registración...	[pasant, are, factur, registr, comprob, realiz...
11	Recepción y acreditaciones Recepción y acredit...	[recepcion, acredit, recepcion, acredit, expos...
12	recepci�n - administracion facturacion, atenc...	[administracion, facturacion, atencion, public...

	experience_final	rejoined
curriculum_id
3	Secretario Muchas nan nan nan nan nan nan nan ...	secretari much ningun econom gestion pur lic rrhh
9	Vendedor de salón Manejo de caja diaria, depós...	vendedor salon manej caj diari deposit bancari...
10	Pasante en el área “Facturación y registración...	pasant are factur registr comprob realiz carg ...
11	Recepción y acreditaciones Recepción y acredit...	recepcion acredit recepcion acredit expositor ...
12	recepci�n - administracion facturacion, atenc...	administracion facturacion atencion public cob...

	curriculum_id	rejoined
0	3	Secretario Muchas ninguno Economía Gestión Pur...
1	9	Vendedor salón Manejo caja diaria depósito ban...
2	10	Pasante área Facturación registración comproba...
3	11	Recepción acreditaciones Recepción acreditació...
4	12	administracion facturacion atencion publico co...

	id	empleo_ofrecido	nombre_generico	fecha_desde	fecha_hasta	tipo_puesto	descripcion	estado	logo	fecha_alta	...	area_id	empleador_id	localidad_id	privacidad	link_redireccion	fecha_republicacion	destacado	horarios_flexibles
0	1598	Cadetes	Importante Cadena Gastronómica	2015-01-28 09:14:00-03	2015-02-28 09:14:00-03	1	Solicitamos Cadetes con papeles al día (Excluy...	3	NaN	2015-01-28 09:14:00-03	...	16	18208	4761	t	NaN	NaN	f	f
1	2773	ddgg	NaN	2017-05-11 17:10:03.259985-03	2017-06-10 17:10:03.259985-03	6	dfgdfgdg	3	NaN	2017-05-11 17:10:03.259985-03	...	40	21310	5379	f	NaN	NaN	f	f
2	2665	VENDEDOR VIAJANTE	NaN	2017-05-15 00:00:00-03	2017-06-14 00:00:00-03	6	Nuestro cliente, importante Grupo incorporará...	3	NaN	2017-05-15 00:00:00-03	...	2	11236	4749	f	NaN	NaN	f	f
3	2875	Traumatólogos, Kinesiólogos, Fonoaudiólogos y ...	Nuevo Servicio de Salud	2017-06-06 07:38:20-03	2017-07-06 07:38:20-03	6	Nuevo Servicio de Salud necesita:\r\n\r\nTRAUM...	3	NaN	2017-06-06 07:44:26.51821-03	...	23	18208	4807	t	NaN	NaN	f	f
4	2659	ENCARGADO DE COMPRAS	EMPRESA DE ABERTURAS	2017-04-11 16:02:05-03	2017-05-11 16:02:05-03	6	Zona San Lorenzo.\r\nBúsqueda orientada a pers...	3	NaN	2017-04-11 16:02:05-03	...	2	18208	3709	t	NaN	NaN	f	f

	id	descripcion2
0	1598	Cadetes Solicitamos Cadetes con papeles al día...
1	2773	ddgg dfgdfgdg
2	2665	VENDEDOR VIAJANTE Nuestro cliente, importante...
3	2875	Traumatólogos, Kinesiólogos, Fonoaudiólogos y ...
4	2659	ENCARGADO DE COMPRAS Zona San Lorenzo.\r\nBúsq...
...	...	...
4707	6057	ENCARGADO DE MANTENIMIENTO <p>Importante empre...
4708	6065	Ayudante de cocina <p>Estamos buscando ayudan...
4709	6064	Contador Sr. <p><strong>Formación</strong>: Co...
4710	6063	Auxiliar Mantenimiento <p>Importante empresa d...
4711	6066	Operario Metalurgico <p>Operario de producción...

	id	descripcion2	tokenized	stopwords_removed	porter_stemmed	rejoined
0	1598	Cadetes Solicitamos Cadetes con papeles al día...	[Cadetes, Solicitamos, Cadetes, con, papeles, ...	[Cadetes, Solicitamos, Cadetes, papeles, día, ...	[cadet, solicit, cadet, papel, dia, excluyent]	cadet solicit cadet papel dia excluyent
1	2773	ddgg dfgdfgdg	[ddgg, dfgdfgdg]	[ddgg, dfgdfgdg]	[ddgg, dfgdfgdg]	ddgg dfgdfgdg
2	2665	VENDEDOR VIAJANTE Nuestro cliente, importante...	[VENDEDOR, VIAJANTE, Nuestro, cliente, importa...	[VENDEDOR, VIAJANTE, Nuestro, cliente, importa...	[vendedor, viajant, nuestr, client, import, gr...	vendedor viajant nuestr client import grup inc...
3	2875	Traumatólogos, Kinesiólogos, Fonoaudiólogos y ...	[Traumatólogos, Kinesiólogos, Fonoaudiólogos, ...	[Traumatólogos, Kinesiólogos, Fonoaudiólogos, ...	[traumatolog, kinesiolog, fonoaudiolog, enferm...	traumatolog kinesiolog fonoaudiolog enfermer n...
4	2659	ENCARGADO DE COMPRAS Zona San Lorenzo.\r\nBúsq...	[ENCARGADO, DE, COMPRAS, Zona, San, Lorenzo, B...	[ENCARGADO, DE, COMPRAS, Zona, San, Lorenzo, B...	[encarg, de, compr, zon, san, lorenz, busqued,...	encarg de compr zon san lorenz busqued orient ...

	ProfileID	title	score
0	54531	contador contador contador public	0.66636
1	33633	contador contador public	0.654488
2	14429	administr contador public	0.619529
3	16295	administr contador public	0.619529
4	24683	administr contador public	0.619529
5	30028	administr contador public	0.619529
6	30871	administr contador public	0.619529
7	34603	administr contador public	0.619529
8	40722	administr contador public	0.619529
9	22415	contador public contador public contador public	0.599788

	ProfileID	title	score	descripcion_perfil_original	curriculum_id
0	37063	ingeni product tecnic industri proces	0.395606	Ingeniería producto Técnico industrias procesos	37063
1	60752	respons departament ingeni proces respons inge...	0.358354	Responsable departamento Ingeniería Procesos R...	60752
2	68663	ingenier desarroll product gestion proyect cli...	0.35219	Ingeniero Desarrollo Productos Gestión Proyect...	68663
3	28274	produccion uso medicion instrument maquin espe...	0.338104	Produccion Uso Medición instrumentos maquinas ...	28274
4	30987	desarroll web servici tecnic ingeni informat	0.337486	Desarrollo Web Servicio Técnico Ingeniería Inf...	30987
5	49733	ingenier desarroll aplic escritori web movil s...	0.318365	Ingeniero Desarrollo Aplicaciones escritorio w...	49733
6	73045	ingenier proces ingeni proces desarroll formul...	0.310022	Ingeniero procesos Ingeniería Procesos desarro...	73045
7	12463	practic alli desarroll nuev distribu are reali...	0.302509	Practicante Allí desarrollo nueva distribución...	12463
8	1570	tecnolog desarroll proces operari produccion e...	0.300395	Tecnología Desarrollo Procesos Operario produc...	1570
9	21438	desarroll control metod tiemp desarroll proces...	0.297621	Desarrollo Control Métodos Tiempos Desarrollo ...	21438

	ProfileID	title	score	descripcion_perfil_original	curriculum_id
0	68663	ingenier desarroll product gestion proyect cli...	0.54139	Ingeniero Desarrollo Productos Gestión Proyect...	68663
1	37063	ingeni product tecnic industri proces	0.536875	Ingeniería producto Técnico industrias procesos	37063
2	28274	produccion uso medicion instrument maquin espe...	0.487088	Produccion Uso Medición instrumentos maquinas ...	28274
3	76869	ingenier product dentr departament desarroll p...	0.46376	Ingeniero producto Dentro departamento desarro...	76869
4	28944	respons ingeni product respons ingeni product ...	0.461413	Responsable Ingeniería Producto Responsable In...	28944
5	24166	expert product sudamer diseñ desarroll product...	0.45862	Experto Producto Sudamérica Diseño Desarrollo ...	24166
6	35002	servici tecnic vent promocion tecnic charl pro...	0.449395	Servicio técnico ventas Promoción técnica Char...	35002
7	30987	desarroll web servici tecnic ingeni informat	0.443422	Desarrollo Web Servicio Técnico Ingeniería Inf...	30987
8	14893	desd are ingeni asistent tecnic product realiz...	0.441167	Desde Area ingeniería asistencia técnica produ...	14893
9	85194	gerent de supply chain desarroll nuev proveedo...	0.437608	GERENTE DE SUPPLY CHAIN Desarrollo nuevos prov...	85194

	ProfileID	title	score	descripcion_perfil_original	curriculum_id
0	83534		1.0		83534
1	37063	ingeni product tecnic industri proces	1.099449	Ingeniería producto Técnico industrias procesos	37063
2	60752	respons departament ingeni proces respons inge...	1.132825	Responsable departamento Ingeniería Procesos R...	60752
3	68663	ingenier desarroll product gestion proyect cli...	1.138253	Ingeniero Desarrollo Productos Gestión Proyect...	68663
4	28274	produccion uso medicion instrument maquin espe...	1.150561	Produccion Uso Medición instrumentos maquinas ...	28274
5	30987	desarroll web servici tecnic ingeni informat	1.151098	Desarrollo Web Servicio Técnico Ingeniería Inf...	30987
6	49733	ingenier desarroll aplic escritori web movil s...	1.167592	Ingeniero Desarrollo Aplicaciones escritorio w...	49733
7	73045	ingenier proces ingeni proces desarroll formul...	1.174715	Ingeniero procesos Ingeniería Procesos desarro...	73045
8	12463	practic alli desarroll nuev distribu are reali...	1.181094	Practicante Allí desarrollo nueva distribución...	12463
9	1570	tecnolog desarroll proces operari produccion e...	1.182882	Tecnología Desarrollo Procesos Operario produc...	1570