How to compute the similarity between two text documents

Question

I am looking at working on an NLP project  in any programming language  though Python will be my preference    I want to take two documents and determine how similar they are

User · Accepted Answer

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.

Computing Pairwise Similarities

TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

or, if the documents are plain strings,

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

though Gensim may have more options for this kind of task.

Interpreting the Results

From above, pairwise_similarity is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

You can convert the sparse array to a NumPy array via .toarray() or .A:

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

Let's say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in corpus. You can find the index of the most similar document by taking the argmax of that row, but first you'll need to mask the 1's, which represent the similarity of each document to itself. You can do the latter through np.fill_diagonal(), and the former through np.nanargmax():

>>> import numpy as np     

>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            

>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

User · Answer

To find sentence similarity with very less dataset and to get high accuracy you can use below python package which is using pre-trained BERT models   pip install similar-sentences

User · Answer

You might want to try this online service  for cosine document similarity http   www scurtu it documentSimilarity html  import urllib urllib2 import json API URL  http   www scurtu it apis documentSimilarity  inputDict    inputDict  doc1    Document with some text  inputDict  doc2    Other document with some text  params   urllib urlencode inputDict      f   urllib2 urlopen API URL  params  response  f read   responseObject json loads response    print responseObject

User · Answer

I am combining the solutions from answers of  FredFoo and  Renaud  My solution is able to apply  Renaud s preprocessing on the text corpus of  FredFoo and then display pairwise similarities where the similarity is greater than 0  I ran this code on Windows by installing python and pip first  pip is installed as part of python but you may have to explicitly do it by re-running the installation package  choosing modify and then choosing pip  I use the command line to execute my python code saved in a file  quot similarity py quot   I had to execute the following commands   gt set PYTHONPATH  PYTHONPATH  C   location of python lib   gt python -m pip install sklearn  gt python -m pip install nltk  gt py similarity py  The code for similarity py is as follows  from sklearn feature extraction text import TfidfVectorizer import nltk  string import numpy as np nltk download  punkt     if necessary     stemmer   nltk stem porter PorterStemmer   remove punctuation map   dict  ord char   None  for char in string punctuation   def stem tokens tokens       return  stemmer stem item  for item in tokens   def normalize text       return stem tokens nltk word tokenize text lower   translate remove punctuation map     corpus     quot I d like an apple quot                quot An apple a day keeps the doctor away quot                quot Never compare an apple to an orange quot                quot I prefer scikit-learn to Orange quot                quot The scikit-learn docs are Orange and Blue quot      vect   TfidfVectorizer tokenizer normalize  stop words  english   tfidf   vect fit transform corpus                                                                                                                                                                                                                          pairwise similarity   tfidf   tfidf T   view the pairwise similarities  print pairwise similarity    check how a string is normalized print normalize  quot The scikit-learn docs are Orange and Blue quot

User · Answer

Identical to  larsman  but with some preprocessing  import nltk  string from sklearn feature extraction text import TfidfVectorizer  nltk download  punkt     if necessary      stemmer   nltk stem porter PorterStemmer   remove punctuation map   dict  ord char   None  for char in string punctuation   def stem tokens tokens       return  stemmer stem item  for item in tokens      remove punctuation  lowercase  stem    def normalize text       return stem tokens nltk word tokenize text lower   translate remove punctuation map     vectorizer   TfidfVectorizer tokenizer normalize  stop words  english    def cosine sim text1  text2       tfidf   vectorizer fit transform  text1  text2       return   tfidf   tfidf T  A  0 1    print cosine sim  a little bird    a little bird   print cosine sim  a little bird    a little bird chirps   print cosine sim  a little bird    a big dog barks

User · Answer

If you are more interested in measuring semantic similarity of two pieces of text  I suggest take a look at this gitlab project  You can run it as a server  there is also a pre-built model which you can use easily to measure the similarity of two pieces of text  even though it is mostly trained for measuring the similarity of two sentences  you can still use it in your case It is written in java but you can run it as a RESTful service    Another option also is DKPro Similarity which is a library with various algorithm to measure the similarity of texts  However  it is also written in java    code example      this similarity measure is defined in the dkpro similarity algorithms lexical-asl package    you need to add that to your  pom to make that example work    there are some examples that should work out of the box in dkpro similarity example-gpl  TextSimilarityMeasure measure   new WordNGramJaccardMeasure 3         Use word trigrams  String   tokens1    This is a short example text    split          String   tokens2    A short example text could look like that    split        double score   measure getSimilarity tokens1  tokens2    System out println  Similarity      score

User · Answer

It s an old question  but I found this can be done easily with Spacy  Once the document is read  a simple api similarity can be used to find the cosine similarity between the document vectors   import spacy nlp   spacy load  en   doc1   nlp u Hello hi there    doc2   nlp u Hello hi there    doc3   nlp u Hey whatsup     print doc1 similarity doc2    0 999999954642 print doc2 similarity doc3    0 699032527716 print doc1 similarity doc3    0 699032527716

User · Answer

Generally a cosine similarity between two documents is used as a similarity measure of documents  In Java  you can use Lucene  if your collection is pretty large  or LingPipe to do this  The basic concept would be to count the terms in every document and calculate the dot product of the term vectors  The libraries do provide several improvements over this general approach  e g  using inverse document frequencies and calculating tf-idf vectors  If you are looking to do something copmlex  LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity   For Python  you can use NLTK

User · Answer

For Syntactic Similarity  There can be 3 easy ways of detecting similarity    Word2Vec Glove Tfidf or countvectorizer   For Semantic Similarity One can use BERT Embedding and try a different word pooling strategies to get document embedding and then apply cosine similarity on document embedding    An advanced methodology can use BERT SCORE to get similarity     Research Paper Link  https   arxiv org abs 1904 09675

User · Answer

If you are looking for something very accurate  you need to use some better tool than tf-idf  Universal sentence encoder is one of the most accurate ones to find the similarity between any two pieces of text  Google provided pretrained models that you can use for your own application without a need to train from scratch anything  First  you have to install tensorflow and tensorflow-hub       pip install tensorflow     pip install tensorflow hub   The code below lets you convert any text to a fixed length vector representation and then you can use the dot product to find out the similarity between them  import tensorflow hub as hub module url    https   tfhub dev google universal-sentence-encoder 1 tf-hub-format compressed     Import the Universal Sentence Encoder s TF Hub module embed   hub Module module url     sample text messages       Smartphones  My phone is not good     Your cellphone looks great       Weather  Will it snow tomorrow     Recently a lot of hurricanes have hit the US      Food and health  An apple a day  keeps the doctors away    Eating strawberries is healthy      similarity input placeholder   tf placeholder tf string  shape  None   similarity message encodings   embed similarity input placeholder  with tf Session   as session      session run tf global variables initializer        session run tf tables initializer        message embeddings    session run similarity message encodings  feed dict  similarity input placeholder  messages        corr   np inner message embeddings   message embeddings       print corr      heatmap messages  messages  corr    and the code for plotting   def heatmap x labels  y labels  values       fig  ax   plt subplots       im   ax imshow values         We want to show all ticks        ax set xticks np arange len x labels        ax set yticks np arange len y labels              and label them with the respective list entries     ax set xticklabels x labels      ax set yticklabels y labels         Rotate the tick labels and set their alignment      plt setp ax get xticklabels    rotation 45  ha  right   fontsize 10           rotation mode  anchor          Loop over data dimensions and create text annotations      for i in range len y labels            for j in range len x labels                text   ax text j  i     2f  values i  j                              ha  center   va  center   color  w    fontsize 6       fig tight layout       plt show     the result would be    as you can see the most similarity is between texts with themselves and then with their close texts in meaning   IMPORTANT  the first time you run the code it will be slow because it needs to download the model  if you want to prevent it from downloading the model again and use the local model you have to create a folder for cache and add it to the environment variable and then after the first time running use that path    tf hub cache dir    universal encoder cached   os environ  TFHUB CACHE DIR     tf hub cache dir    pointing to the folder inside cache dir  it will be unique on your system module url   tf hub cache dir   d8fbeb5c580e50f975ef73e80bebba9654228449   embed   hub Module module url    More information  https   tfhub dev google universal-sentence-encoder 2

User · Answer

Here s a little app to get you started     import difflib as dl  a   file  file   read   b   file  file1   read    sim   dl get close matches  s   0 wa   a split   wb   b split    for i in wa      if sim i  wb           s    1  n   float s    float len wa   print   d   similarity    int n   100

[nlp] How to compute the similarity between two text documents?

Computing Pairwise Similarities

Interpreting the Results

Examples related to nlp