Calculate cosine similarity given 2 sentence strings

Question

From Python  tf-idf-cosine  to find document similarity   it is possible to calculate document similarity using tf-idf cosine  Without importing external libraries  are that any ways to calculate cosine similarity between 2 strings   s1    This is a foo bar sentence    s2    This sentence is similar to a foo bar sentence    s3    What is this string   Totally not related to the other two lines     cosine sim s1  s2    Should give high cosine similarity cosine sim s1  s3    Shouldn t give high cosine similarity value cosine sim s2  s3    Shouldn t give high cosine similarity value

User · Answer

Try this  Download the file  numberbatch-en-17 06 txt  from https   conceptnet s3 amazonaws com downloads 2017 numberbatch numberbatch-en-17 06 txt gz and extract it  The function  get sentence vector  uses a simple sum of word vectors  However it can be improved by using weighted sum where weights are proportional to Tf-Idf of each word    import math import numpy as np  std embeddings index      with open  path to numberbatch-en-17 06 txt   as f      for line in f          values   line split              word   values 0          embedding   np asarray values 1    dtype  float32           std embeddings index word    embedding  def cosineValue v1 v2        compute cosine similarity of v1 to v2   v1 dot v2     v1     v2         sumxx  sumxy  sumyy   0  0  0     for i in range len v1            x   v1 i   y   v2 i          sumxx    x x         sumyy    y y         sumxy    x y     return sumxy math sqrt sumxx sumyy    def get sentence vector sentence  std embeddings index   std embeddings index        sent vector   0     for word in sentence lower   split            if word not in std embeddings index               word vector   np array np random uniform -1 0  1 0  300               std embeddings index word    word vector         else              word vector   std embeddings index word          sent vector   sent vector   word vector      return sent vector  def cosine sim sent1  sent2       return cosineValue get sentence vector sent1   get sentence vector sent2     I did run for the given sentences and found the following results  s1    This is a foo bar sentence    s2    This sentence is similar to a foo bar sentence    s3    What is this string   Totally not related to the other two lines     print cosine sim s1  s2    Should give high cosine similarity print cosine sim s1  s3    Shouldn t give high cosine similarity value print cosine sim s2  s3    Shouldn t give high cosine similarity value  0 9851735249068168 0 6570885718962608 0 6589335425458225

User · Answer

The short answer is  no  it is not possible to do that in a principled way that works even remotely well   It is an unsolved problem in natural language processing research and also happens to be the subject of my doctoral work  I ll very briefly summarize where we are and point you to a few publications   Meaning of words  The most important assumption here is that it is possible to obtain a vector that represents each word in the sentence in quesion  This vector is usually chosen to capture the contexts the word can appear in  For example  if we only consider the three contexts  eat    red  and  fluffy   the word  cat  might be represented as  98  1  87   because if you were to read a very very long piece of text  a few billion words is not uncommon by today s standard   the word  cat  would appear very often in the context of  fluffy  and  eat   but not that often in the context of  red   In the same way   dog  might be represented as  87 2 34  and  umbrella  might be  1 13 0   Imagening these vectors as points in 3D space   cat  is clearly closer to  dog  than it is to  umbrella   therefore  cat  also means something more similar to  dog  than to an  umbrella    This line of work has been investigated since the early 90s  e g  this work by Greffenstette  and has yielded some surprisingly good results  For example  here is a few random entries in a thesaurus I built recently by having my computer read wikipedia   theory - gt  analysis  concept  approach  idea  method voice - gt  vocal  tone  sound  melody  singing james - gt  william  john  thomas  robert  george  charles   These lists of similar words were obtained entirely without human intervention- you feed text in and come back a few hours later   The problem with phrases  You might ask why we are not doing the same thing for longer phrases  such as  ginger foxes love fruit   It s because we do not have enough text  In order for us to reliably establish what X is similar to  we need to see many examples of X being used in context  When X is a single word like  voice   this is not too hard  However  as X gets longer  the chances of finding natural occurrences of X get exponentially slower  For comparison  Google has about 1B pages containing the word  fox  and not a single page containing  ginger foxes love fruit   despite the fact that it is a perfectly valid English sentence and we all understand what it means   Composition  To tackle the problem of data sparsity  we want to perform composition  i e  to take vectors for words  which are easy to obtain from real text  and to put the together in a way that captures their meaning  The bad news is nobody has been able to do that well so far   The simplest and most obvious way is to add or multiply the individual word vectors together  This leads to undesirable side effect that  cats chase dogs  and  dogs chase cats  would mean the same to your system  Also  if you are multiplying  you have to be extra careful or every sentences will end up represented by  0 0 0     0   which defeats the point   Further reading  I will not discuss the more sophisticated methods for composition that have been proposed so far  I suggest you read Katrin Erk s  Vector space models of word meaning and phrase meaning  a survey   This is a very good high-level survey to get you started  Unfortunately  is not freely available on the publisher s website  email the author directly to get a copy  In that paper you will find references to many more concrete methods  The more comprehensible ones are by Mitchel and Lapata  2008  and Baroni and Zamparelli  2010      Edit after comment by  vpekar  The bottom line of this answer is to stress the fact that while naive methods do exist  e g  addition  multiplication  surface similarity  etc   these are fundamentally flawed and in general one should not expect great performance from them

User · Answer

Thanks  vpekar for your implementation  It helped a lot  I just found that it misses the tf-idf weight while calculating the cosine similarity  The Counter word  returns a dictionary which has the list of words along with their occurence   cos q  d    sim q  d     q    d    q  d      sum qi  di   sqrt sum qi2     sqrt sum vi2    where i   1 to v    qi is the tf-idf weight of term i in the query  di is the tf-idf weight of term i in the document   q  and  d  are the lengths of q and d   This is the cosine similarity of q and d             or  equivalently  the cosine of the angle between q and d    Please feel free to view my code here  But first you will have to download the anaconda package  It will automatically set you python path in Windows  Add this python interpreter in Eclipse

User · Answer

Well  if you are aware of word embeddings like Glove Word2Vec Numberbatch  your job is half done  If not let me explain how this can be tackled   Convert each sentence into word tokens  and represent each of these tokens as vectors of high dimension  using the pre-trained word embeddings  or you could train them yourself even    So  now you just don t capture their surface similarity but rather extract the meaning of each word which comprise the sentence as a whole  After this calculate their cosine similarity and you are set

User · Answer

A simple pure-Python implementation would be   import math import re from collections import Counter  WORD   re compile r  w      def get cosine vec1  vec2       intersection   set vec1 keys     amp  set vec2 keys        numerator   sum  vec1 x    vec2 x  for x in intersection        sum1   sum  vec1 x     2 for x in list vec1 keys          sum2   sum  vec2 x     2 for x in list vec2 keys          denominator   math sqrt sum1    math sqrt sum2       if not denominator          return 0 0     else          return float numerator    denominator   def text to vector text       words   WORD findall text      return Counter words    text1    This is a foo bar sentence    text2    This sentence is similar to a foo bar sentence     vector1   text to vector text1  vector2   text to vector text2   cosine   get cosine vector1  vector2   print  Cosine    cosine    Prints   Cosine  0 861640436855   The cosine formula used here is described here   This does not include weighting of the words by tf-idf  but in order to use tf-idf  you need to have a reasonably large corpus from which to estimate tfidf weights   You can also develop it further  by using a more sophisticated way to extract words from a piece of text  stem or lemmatise it  etc

User · Answer

I have similar solution but might be useful for pandas   import math import re from collections import Counter import pandas as pd  WORD   re compile r  w      def get cosine vec1  vec2       intersection   set vec1 keys     amp  set vec2 keys        numerator   sum  vec1 x    vec2 x  for x in intersection        sum1   sum  vec1 x     2 for x in list vec1 keys          sum2   sum  vec2 x     2 for x in list vec2 keys          denominator   math sqrt sum1    math sqrt sum2       if not denominator          return 0 0     else          return float numerator    denominator   def text to vector text       words   WORD findall text      return Counter words   df pd read csv   content drive article csv   df  vector1   df  headline   apply lambda x  text to vector x    df  vector2   df  snippet   apply lambda x  text to vector x    df  simscore   df apply lambda x  get cosine x  vector1   x  vector2    axis 1

[python] Calculate cosine similarity given 2 sentence strings

Examples related to python

Examples related to string

Examples related to nlp

Examples related to similarity

Examples related to cosine-similarity