Here is a function that compares your test data against the training data, with the Tf-Idf transformer fitted with the training data. Advantage is that you can quickly pivot or group by to find the n closest elements, and that the calculations are down matrix-wise.
def create_tokenizer_score(new_series, train_series, tokenizer):
"""
return the tf idf score of each possible pairs of documents
Args:
new_series (pd.Series): new data (To compare against train data)
train_series (pd.Series): train data (To fit the tf-idf transformer)
Returns:
pd.DataFrame
"""
train_tfidf = tokenizer.fit_transform(train_series)
new_tfidf = tokenizer.transform(new_series)
X = pd.DataFrame(cosine_similarity(new_tfidf, train_tfidf), columns=train_series.index)
X['ix_new'] = new_series.index
score = pd.melt(
X,
id_vars='ix_new',
var_name='ix_train',
value_name='score'
)
return score
train_set = pd.Series(["The sky is blue.", "The sun is bright."])
test_set = pd.Series(["The sun in the sky is bright."])
tokenizer = TfidfVectorizer() # initiate here your own tokenizer (TfidfVectorizer, CountVectorizer, with stopwords...)
score = create_tokenizer_score(train_series=train_set, new_series=test_set, tokenizer=tokenizer)
score
ix_new ix_train score
0 0 0 0.617034
1 0 1 0.862012