[python] Calculate cosine similarity given 2 sentence strings

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?

s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."

cosine_sim(s1, s2) # Should give high cosine similarity
cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
cosine_sim(s2, s3) # Shouldn't give high cosine similarity value

This question is related to python string nlp similarity cosine-similarity

The answer is


The short answer is "no, it is not possible to do that in a principled way that works even remotely well". It is an unsolved problem in natural language processing research and also happens to be the subject of my doctoral work. I'll very briefly summarize where we are and point you to a few publications:

Meaning of words

The most important assumption here is that it is possible to obtain a vector that represents each word in the sentence in quesion. This vector is usually chosen to capture the contexts the word can appear in. For example, if we only consider the three contexts "eat", "red" and "fluffy", the word "cat" might be represented as [98, 1, 87], because if you were to read a very very long piece of text (a few billion words is not uncommon by today's standard), the word "cat" would appear very often in the context of "fluffy" and "eat", but not that often in the context of "red". In the same way, "dog" might be represented as [87,2,34] and "umbrella" might be [1,13,0]. Imagening these vectors as points in 3D space, "cat" is clearly closer to "dog" than it is to "umbrella", therefore "cat" also means something more similar to "dog" than to an "umbrella".

This line of work has been investigated since the early 90s (e.g. this work by Greffenstette) and has yielded some surprisingly good results. For example, here is a few random entries in a thesaurus I built recently by having my computer read wikipedia:

theory -> analysis, concept, approach, idea, method
voice -> vocal, tone, sound, melody, singing
james -> william, john, thomas, robert, george, charles

These lists of similar words were obtained entirely without human intervention- you feed text in and come back a few hours later.

The problem with phrases

You might ask why we are not doing the same thing for longer phrases, such as "ginger foxes love fruit". It's because we do not have enough text. In order for us to reliably establish what X is similar to, we need to see many examples of X being used in context. When X is a single word like "voice", this is not too hard. However, as X gets longer, the chances of finding natural occurrences of X get exponentially slower. For comparison, Google has about 1B pages containing the word "fox" and not a single page containing "ginger foxes love fruit", despite the fact that it is a perfectly valid English sentence and we all understand what it means.

Composition

To tackle the problem of data sparsity, we want to perform composition, i.e. to take vectors for words, which are easy to obtain from real text, and to put the together in a way that captures their meaning. The bad news is nobody has been able to do that well so far.

The simplest and most obvious way is to add or multiply the individual word vectors together. This leads to undesirable side effect that "cats chase dogs" and "dogs chase cats" would mean the same to your system. Also, if you are multiplying, you have to be extra careful or every sentences will end up represented by [0,0,0,...,0], which defeats the point.

Further reading

I will not discuss the more sophisticated methods for composition that have been proposed so far. I suggest you read Katrin Erk's "Vector space models of word meaning and phrase meaning: a survey". This is a very good high-level survey to get you started. Unfortunately, is not freely available on the publisher's website, email the author directly to get a copy. In that paper you will find references to many more concrete methods. The more comprehensible ones are by Mitchel and Lapata (2008) and Baroni and Zamparelli (2010).


Edit after comment by @vpekar: The bottom line of this answer is to stress the fact that while naive methods do exist (e.g. addition, multiplication, surface similarity, etc), these are fundamentally flawed and in general one should not expect great performance from them.


I have similar solution but might be useful for pandas

import math
import re
from collections import Counter
import pandas as pd

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

df=pd.read_csv('/content/drive/article.csv')
df['vector1']=df['headline'].apply(lambda x: text_to_vector(x)) 
df['vector2']=df['snippet'].apply(lambda x: text_to_vector(x)) 
df['simscore']=df.apply(lambda x: get_cosine(x['vector1'],x['vector2']),axis=1)

Well, if you are aware of word embeddings like Glove/Word2Vec/Numberbatch, your job is half done. If not let me explain how this can be tackled. Convert each sentence into word tokens, and represent each of these tokens as vectors of high dimension (using the pre-trained word embeddings, or you could train them yourself even!). So, now you just don't capture their surface similarity but rather extract the meaning of each word which comprise the sentence as a whole. After this calculate their cosine similarity and you are set.


Thanks @vpekar for your implementation. It helped a lot. I just found that it misses the tf-idf weight while calculating the cosine similarity. The Counter(word) returns a dictionary which has the list of words along with their occurence.

cos(q, d) = sim(q, d) = (q ยท d)/(|q||d|) = (sum(qi, di)/(sqrt(sum(qi2)))*(sqrt(sum(vi2))) where i = 1 to v)

  • qi is the tf-idf weight of term i in the query.
  • di is the tf-idf
  • weight of term i in the document. |q| and |d| are the lengths of q and d.
  • This is the cosine similarity of q and d . . . . . . or, equivalently, the cosine of the angle between q and d.

Please feel free to view my code here. But first you will have to download the anaconda package. It will automatically set you python path in Windows. Add this python interpreter in Eclipse.


Try this. Download the file 'numberbatch-en-17.06.txt' from https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz and extract it. The function 'get_sentence_vector' uses a simple sum of word vectors. However it can be improved by using weighted sum where weights are proportional to Tf-Idf of each word.

import math
import numpy as np

std_embeddings_index = {}
with open('path/to/numberbatch-en-17.06.txt') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        std_embeddings_index[word] = embedding

def cosineValue(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)


def get_sentence_vector(sentence, std_embeddings_index = std_embeddings_index ):
    sent_vector = 0
    for word in sentence.lower().split():
        if word not in std_embeddings_index :
            word_vector = np.array(np.random.uniform(-1.0, 1.0, 300))
            std_embeddings_index[word] = word_vector
        else:
            word_vector = std_embeddings_index[word]
        sent_vector = sent_vector + word_vector

    return sent_vector

def cosine_sim(sent1, sent2):
    return cosineValue(get_sentence_vector(sent1), get_sentence_vector(sent2))

I did run for the given sentences and found the following results

s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."

print cosine_sim(s1, s2) # Should give high cosine similarity
print cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
print cosine_sim(s2, s3) # Shouldn't give high cosine similarity value

0.9851735249068168
0.6570885718962608
0.6589335425458225

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to nlp

Replace specific text with a redacted version using Python How to return history of validation loss in Keras How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn? Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) Stopword removal with NLTK How to get rid of punctuation using NLTK tokenizer? Calculate cosine similarity given 2 sentence strings How do I tokenize a string sentence in NLTK? How to compute the similarity between two text documents? How do I do word Stemming or Lemmatization?

Examples related to similarity

What's the fastest way in Python to calculate cosine similarity given sparse matrix data? Find the similarity metric between two strings Calculate cosine similarity given 2 sentence strings Checking images for similarity with OpenCV

Examples related to cosine-similarity

Cosine Similarity between 2 Number Lists What's the fastest way in Python to calculate cosine similarity given sparse matrix data? Calculate cosine similarity given 2 sentence strings Can someone give an example of cosine similarity, in a very simple, graphical way?