How to calculate the sentence similarity using word2vec model of gensim with python

Question

According to the Gensim Word2Vec  I can use the word2vec model in gensim package to calculate the similarity between 2 words   e g   trained model similarity  woman    man    0 73723527   However  the word2vec model fails to predict the sentence similarity  I find out the LSI model with sentence similarity in gensim  but  which doesn t seem that can be combined with word2vec model  The length of corpus of each sentence I have is not very long  shorter than 10 words    So  are there any simple ways to achieve the goal

User · Answer

Once you compute the sum of the two sets of word vectors  you should take the cosine between the vectors  not the diff  The cosine can be computed by taking the dot product of the two vectors normalized  Thus  the word count is not a factor

User · Answer

You can just add the word vectors of one sentence together  Then count the Cosine similarity of two sentence vector as the similarity of two sentence  I think that s  the most easy way

User · Answer

Facebook Research group released a new solution called InferSent  Results and code are published on Github  check their repo  It is pretty awesome  I am planning to use it   https   github com facebookresearch InferSent   their paper https   arxiv org abs 1705 02364  Abstract   Many modern NLP systems rely on word embeddings  previously trained in an unsupervised manner on large corpora  as base features  Efforts to obtain embeddings for larger chunks of text  such as sentences  have however not been so successful  Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted  In this paper  we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks  Much like how computer vision uses ImageNet to obtain features  which can then be transferred to other tasks  our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks  Our encoder is publicly available

User · Answer

I am using the following method and it works well  You first need to run a POSTagger and then filter your sentence to get rid of the stop words  determinants  conjunctions        I recommend TextBlob APTagger  Then you build a word2vec by taking the mean of each word vector in the sentence  The n similarity method in Gemsim word2vec does exactly that by allowing to pass two sets of words to compare

User · Answer

you can use Word Mover s Distance algorithm  here is an easy description about WMD    load word2vec model  here GoogleNews is used model   gensim models KeyedVectors load word2vec format     GoogleNews-vectors-negative300 bin   binary True   two sample sentences  s1    the first sentence  s2    the second text    calculate distance between two sentences using WMD algorithm distance   model wmdistance s1  s2   print   distance     3f    distance    P s   if you face an error about import pyemd library  you can install it using following command   pip install pyemd

User · Answer

If you are using word2vec  you need to calculate the average vector for all words in every sentence document and use cosine similarity between vectors   import numpy as np from scipy import spatial  index2word set   set model wv index2word   def avg feature vector sentence  model  num features  index2word set       words   sentence split       feature vec   np zeros  num features     dtype  float32       n words   0     for word in words          if word in index2word set              n words    1             feature vec   np add feature vec  model word       if  n words  gt  0           feature vec   np divide feature vec  n words      return feature vec   Calculate similarity   s1 afv   avg feature vector  this is a sentence   model model  num features 300  index2word set index2word set  s2 afv   avg feature vector  this is also sentence   model model  num features 300  index2word set index2word set  sim   1 - spatial distance cosine s1 afv  s2 afv  print sim    gt  0 915479828613

User · Answer

If not using Word2Vec we have other model to find it using BERT for embed  Below are reference link  https   github com UKPLab sentence-transformers  pip install -U sentence-transformers  from sentence transformers import SentenceTransformer import scipy spatial  embedder   SentenceTransformer  bert-base-nli-mean-tokens      Corpus with example sentences corpus     A man is eating a food               A man is eating a piece of bread               The girl is carrying a baby               A man is riding a horse               A woman is playing violin               Two men pushed carts through the woods               A man is riding a white horse on an enclosed ground               A monkey is playing drums               A cheetah is running behind its prey               corpus embeddings   embedder encode corpus     Query sentences  queries     A man is eating pasta     Someone in a gorilla costume is playing a set of drums     A cheetah chases prey on across a field    query embeddings   embedder encode queries     Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity closest n   5 for query  query embedding in zip queries  query embeddings       distances   scipy spatial distance cdist  query embedding   corpus embeddings   cosine   0       results   zip range len distances    distances      results   sorted results  key lambda x  x 1        print   n n                       n n       print  Query    query      print   nTop 5 most similar sentences in corpus         for idx  distance in results 0 closest n           print corpus idx  strip      Score    4f      1-distance     Other Link to follow https   github com hanxiao bert-as-service

User · Answer

I have tried the methods provided by the previous answers  It works  but the main drawback of it is that the longer the sentences the larger similarity will be to calculate the similarity I use the cosine score of the two mean embeddings of any two sentences  since the more the words the more positive semantic effects will be added to the sentence    I thought I should change my mind and use the sentence embedding instead as studied in this paper and this

User · Answer

Gensim implements a model called Doc2Vec for paragraph embedding   There are different tutorials presented as IPython notebooks    Doc2Vec Tutorial on the Lee Dataset  Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset Doc2Vec to wikipedia articles   Another method would rely on Word2Vec and Word Mover s Distance  WMD   as shown in this tutorial    Finding similar documents with Word2Vec and WMD    An alternative solution would be to rely on average vectors   from gensim models import KeyedVectors from gensim utils import simple preprocess      def tidy sentence sentence  vocabulary       return  word for word in simple preprocess sentence  if word in vocabulary       def compute sentence similarity sentence 1  sentence 2  model wv       vocabulary   set model wv index2word          tokens 1   tidy sentence sentence 1  vocabulary          tokens 2   tidy sentence sentence 2  vocabulary          return model wv n similarity tokens 1  tokens 2   wv   KeyedVectors load  model wv   mmap  r   sim   compute sentence similarity  this is a sentence    this is also a sentence   wv  print sim    Finally  if you can run Tensorflow  you may try  https   tfhub dev google universal-sentence-encoder 2

User · Answer

I would like to update the existing solution to help the people who are going to calculate the semantic similarity of sentences   Step 1   Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list  Step 2   Computing the sentence vector  The calculation of semantic similarity between sentences was difficult before but recently a paper named  A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS  was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component Here the weight of a word w is a  a   p w   with a being a parameter and p w  the  estimated  word frequency called smooth inverse frequency this method performing significantly better   A simple code to calculate the sentence vector using SIF smooth inverse frequency  the method proposed in the paper has been given here  Step 3  using sklearn cosine similarity load two vectors for the sentences and compute the similarity   This is the most simple and efficient method to compute the sentence similarity

User · Answer

Since you re using gensim  you should probably use it s doc2vec implementation  doc2vec is an extension of word2vec to the phrase-  sentence-  and document-level  It s a pretty simple extension  described here  http   cs stanford edu  quocle paragraph vector pdf  Gensim is nice because it s intuitive  fast  and flexible  What s great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim s Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors   GoogleNews-vectors-negative300 bin gz  as linked in Google Code   I think gensim is definitely the easiest  and so far for me  the best  tool for embedding a sentence in a vector space   There exist other sentence-to-vector techniques than the one proposed in Le  amp  Mikolov s paper above  Socher and Manning from Stanford are certainly two of the most famous researchers working in this area  Their work has been based on the principle of compositionally - semantics of the sentence come from   1  semantics of the words  2  rules for how these words interact and combine into phrases   They ve proposed a few such models  getting increasingly more complex  for how to use compositionality to build sentence-level representations   2011 - unfolding recursive autoencoder  very comparatively simple  start here if interested   2012 - matrix-vector neural network  2013 - neural tensor network  2015 - Tree LSTM  his papers are all available at socher org  Some of these models are available  but I d still recommend gensim s doc2vec  For one  the 2011 URAE isn t particularly powerful  In addition  it comes pretrained with weights suited for paraphrasing news-y data  The code he provides does not allow you to retrain the network  You also can t swap in different word vectors  so you re stuck with 2011 s pre-word2vec embeddings from Turian  These vectors are certainly not on the level of word2vec s or GloVe s   Haven t worked with the Tree LSTM yet  but it seems very promising   tl dr Yeah  use gensim s doc2vec  But other methods do exist

User · Answer

There are extensions of Word2Vec intended to solve the problem of comparing longer pieces of text like phrases or sentences  One of them is paragraph2vec or doc2vec    Distributed Representations of Sentences and Documents  http   cs stanford edu  quocle paragraph vector pdf  http   rare-technologies com doc2vec-tutorial

User · Answer

This is actually a pretty challenging problem that you are asking  Computing sentence similarity requires building a grammatical model of the sentence  understanding equivalent structures  e g   he walked to the store yesterday  and  yesterday  he walked to the store    finding similarity not just in the pronouns and verbs but also in the proper nouns  finding statistical co-occurences   relationships in lots of real textual examples  etc   The simplest thing you could try -- though I don t know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all  stop  words  words like  the    an   etc  that don t add much meaning to the sentence  and then run word2vec on the words in both sentences  sum up the vectors in the one sentence  sum up the vectors in the other sentence  and then find the difference between the sums  By summing them up instead of doing a word-wise difference  you ll at least not be subject to word order  That being said  this will fail in lots of ways and isn t a good solution by any means  though good solutions to this problem almost always involve some amount of NLP  machine learning  and other cleverness    So  short answer is  no  there s no easy way to do this  at least not to do it well

User · Answer

There is a function from the documentation taking a list of words and comparing their similarities   s1    This room is dirty  s2    dirty and disgusting room   corrected variable name  distance   model wv n similarity s1 lower   split    s2 lower   split

[python] How to calculate the sentence similarity using word2vec model of gensim with python

Examples related to python

Examples related to gensim

Examples related to word2vec