Python tf-idf-cosine to find document similarity

Question

I was following a tutorial which was available at Part 1  amp  Part 2  Unfortunately the author didn t have the time for the final section which involved using cosine similarity to actually find the distance between two documents  I followed the examples in the article with the help of the following link from stackoverflow  included is the code mentioned in the above link  just so as to make life easier   from sklearn feature extraction text import CountVectorizer from sklearn feature extraction text import TfidfTransformer from nltk corpus import stopwords import numpy as np import numpy linalg as LA  train set     The sky is blue     The sun is bright       Documents test set     The sun in the sky is bright       Query stopWords   stopwords words  english    vectorizer   CountVectorizer stop words   stopWords   print vectorizer transformer   TfidfTransformer    print transformer  trainVectorizerArray   vectorizer fit transform train set  toarray   testVectorizerArray   vectorizer transform test set  toarray   print  Fit Vectorizer to train set   trainVectorizerArray print  Transform Vectorizer to test set   testVectorizerArray  transformer fit trainVectorizerArray  print print transformer transform trainVectorizerArray  toarray    transformer fit testVectorizerArray  print  tfidf   transformer transform testVectorizerArray  print tfidf todense     as a result of the above code I have the following matrix  Fit Vectorizer to train set   1 0 1 0    0 1 0 1   Transform Vectorizer to test set   0 1 1 1       0 70710678  0           0 70710678  0              0           0 70710678  0           0 70710678       0           0 57735027  0 57735027  0 57735027     I am not sure how to use this output in order to calculate cosine similarity  I know how to implement cosine similarity with respect to two vectors of similar length but here I am not sure how to identify the two vectors

User · Answer

Here is a function that compares your test data against the training data  with the Tf-Idf transformer fitted with the training data  Advantage is that you can quickly pivot or group by to find the n closest elements  and that the calculations are down matrix-wise   def create tokenizer score new series  train series  tokenizer               return the tf idf score of each possible pairs of documents     Args          new series  pd Series   new data  To compare against train data          train series  pd Series   train data  To fit the tf-idf transformer      Returns          pd DataFrame              train tfidf   tokenizer fit transform train series      new tfidf   tokenizer transform new series      X   pd DataFrame cosine similarity new tfidf  train tfidf   columns train series index      X  ix new     new series index     score   pd melt          X          id vars  ix new           var name  ix train           value name  score            return score  train set   pd Series   The sky is blue     The sun is bright     test set   pd Series   The sun in the sky is bright     tokenizer   TfidfVectorizer     initiate here your own tokenizer  TfidfVectorizer  CountVectorizer  with stopwords     score   create tokenizer score train series train set  new series test set  tokenizer tokenizer  score     ix new   ix train    score 0   0       0       0 617034 1   0       1       0 862012

User · Answer

WIth the Help of  excray s comment  I manage to figure it out the answer  What we need to do is actually write a simple for loop to iterate over the two arrays that represent the train data and test data    First implement a simple lambda function to hold formula for the cosine calculation   cosine function   lambda a  b   round np inner a  b   LA norm a  LA norm b    3    And then just write a simple for loop to iterate over the to vector  logic is for every  For each vector in trainVectorizerArray  you have to find the cosine similarity with the vector in testVectorizerArray    from sklearn feature extraction text import CountVectorizer from sklearn feature extraction text import TfidfTransformer from nltk corpus import stopwords import numpy as np import numpy linalg as LA  train set     The sky is blue     The sun is bright     Documents test set     The sun in the sky is bright     Query stopWords   stopwords words  english    vectorizer   CountVectorizer stop words   stopWords   print vectorizer transformer   TfidfTransformer    print transformer  trainVectorizerArray   vectorizer fit transform train set  toarray   testVectorizerArray   vectorizer transform test set  toarray   print  Fit Vectorizer to train set   trainVectorizerArray print  Transform Vectorizer to test set   testVectorizerArray cx   lambda a  b   round np inner a  b   LA norm a  LA norm b    3   for vector in trainVectorizerArray      print vector     for testV in testVectorizerArray          print testV         cosine   cx vector  testV          print cosine  transformer fit trainVectorizerArray  print print transformer transform trainVectorizerArray  toarray    transformer fit testVectorizerArray  print  tfidf   transformer transform testVectorizerArray  print tfidf todense     Here is the output   Fit Vectorizer to train set   1 0 1 0    0 1 0 1   Transform Vectorizer to test set   0 1 1 1    1 0 1 0   0 1 1 1  0 408  0 1 0 1   0 1 1 1  0 816     0 70710678  0           0 70710678  0              0           0 70710678  0           0 70710678       0           0 57735027  0 57735027  0 57735027

User · Answer

I know its an old post  but I tried the http   scikit-learn sourceforge net stable  package  here is my code to find the cosine similarity  The question was how will you calculate the cosine similarity with this package and here is my code for that  from sklearn feature extraction text import CountVectorizer from sklearn metrics pairwise import cosine similarity from sklearn feature extraction text import TfidfVectorizer  f   open   root Myfolder scoringDocuments doc1   doc1   str decode f read     UTF-8    ignore   f   open   root Myfolder scoringDocuments doc2   doc2   str decode f read     UTF-8    ignore   f   open   root Myfolder scoringDocuments doc3   doc3   str decode f read     UTF-8    ignore    train set     president of India  doc1  doc2  doc3   tfidf vectorizer   TfidfVectorizer   tfidf matrix train   tfidf vectorizer fit transform train set    finds the tfidf score with normalization print  cosine scores    gt    cosine similarity tfidf matrix train 0 1   tfidf matrix train    here the first element of tfidf matrix train is matched with other three elements   Here suppose the query is the first element of train set and doc1 doc2 and doc3 are the documents which I want to rank with the help of cosine similarity  then I can use this code    Also the tutorials provided in the question was very useful  Here are all the parts for it  part-I part-II part-III  the output will be as follows       1           0 07102631  0 02731343  0 06348799     here 1 represents that query is matched with itself and the other three are the scores for matching the query with the respective documents

User · Answer

First off  if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer    gt  gt  gt  from sklearn feature extraction text import TfidfVectorizer  gt  gt  gt  from sklearn datasets import fetch 20newsgroups  gt  gt  gt  twenty   fetch 20newsgroups     gt  gt  gt  tfidf   TfidfVectorizer   fit transform twenty data   gt  gt  gt  tfidf  lt 11314x130088 sparse matrix of type   lt type  numpy float64  gt       with 1787553 stored elements in Compressed Sparse Row format gt    Now to find the cosine distances of one document  e g  the first in the dataset  and all of the others you just need to compute the dot products of the first vector with all of the others as the tfidf vectors are already row-normalized     As explained by Chris Clark in comments and here Cosine Similarity does not take into account the magnitude of the vectors  Row-normalised have a magnitude of 1 and so the Linear Kernel is sufficient to calculate the similarity values   The scipy sparse matrix API is a bit weird  not as flexible as dense N-dimensional numpy arrays   To get the first vector you need to slice the matrix row-wise to get a submatrix with a single row    gt  gt  gt  tfidf 0 1   lt 1x130088 sparse matrix of type   lt type  numpy float64  gt       with 89 stored elements in Compressed Sparse Row format gt    scikit-learn already provides pairwise metrics  a k a  kernels in machine learning parlance  that work for both dense and sparse representations of vector collections  In this case we need a dot product that is also known as the linear kernel    gt  gt  gt  from sklearn metrics pairwise import linear kernel  gt  gt  gt  cosine similarities   linear kernel tfidf 0 1   tfidf  flatten    gt  gt  gt  cosine similarities array   1            0 04405952   0 11016969        0 04433602      0 04457106   0 03293218     Hence to find the top 5 related documents  we can use argsort and some negative array slicing  most related documents have highest cosine similarity values  hence at the end of the sorted indices array     gt  gt  gt  related docs indices   cosine similarities argsort    -5 -1   gt  gt  gt  related docs indices array      0    958  10576   3277    gt  gt  gt  cosine similarities related docs indices  array   1            0 54967926   0 32902194   0 2825788      The first result is a sanity check  we find the query document as the most similar document with a cosine similarity score of 1 which has the following text    gt  gt  gt  print twenty data 0  From  lerxst wam umd edu  where s my thing  Subject  WHAT car is this   Nntp-Posting-Host  rac3 wam umd edu Organization  University of Maryland  College Park Lines  15   I was wondering if anyone out there could enlighten me on this car I saw the other day  It was a 2-door sports car  looked to be from the late 60s  early 70s  It was called a Bricklin  The doors were really small  In addition  the front bumper was separate from the rest of the body  This is all I know  If anyone can tellme a model name  engine specs  years of production  where this car is made  history  or whatever info you have on this funky looking car  please e-mail   Thanks  - IL    ---- brought to you by your neighborhood Lerxst ----   The second most similar document is a reply that quotes the original message hence has many common words    gt  gt  gt  print twenty data 958  From  rseymour reed edu  Robert Seymour  Subject  Re  WHAT car is this   Article-I D   reed 1993Apr21 032905 29286 Reply-To  rseymour reed edu Organization  Reed College  Portland  OR Lines  26  In article  lt 1993Apr20 174246 14375 wam umd edu gt  lerxst wam umd edu  where s my thing  writes   gt   gt   I was wondering if anyone out there could enlighten me on this car I saw  gt  the other day  It was a 2-door sports car  looked to be from the late 60s   gt  early 70s  It was called a Bricklin  The doors were really small  In addition   gt  the front bumper was separate from the rest of the body  This is  gt  all I know  If anyone can tellme a model name  engine specs  years  gt  of production  where this car is made  history  or whatever info you  gt  have on this funky looking car  please e-mail   Bricklins were manufactured in the 70s with engines from Ford  They are rather odd looking with the encased front bumper  There aren t a lot of them around  but Hemmings  Motor News  ususally has ten or so listed  Basically  they are a performance Ford with new styling slapped on top    gt     ---- brought to you by your neighborhood Lerxst ----  Rush fan   -- Robert Seymour              rseymour reed edu Physics and Philosophy  Reed College     NeXTmail accepted  Artificial Life Project         Reed College Reed Solar Energy Project  SolTrain     Portland  OR

User · Answer

This should help you     from sklearn feature extraction text import TfidfVectorizer from sklearn metrics pairwise import cosine similarity    tfidf vectorizer   TfidfVectorizer   tfidf matrix   tfidf vectorizer fit transform train set  print tfidf matrix cosine   cosine similarity tfidf matrix length-1   tfidf matrix  print cosine   and output will be      0 34949812  0 81649658  1

User · Answer

Let me give you another tutorial written by me  It answers your question  but also makes an explanation why we are doing some of the things  I also tried to make it concise    So you have a list of documents which is just an array of strings and another document which is just a string  You need to find such document from the list of documents that is the most similar to document   Let s combine them together  documents   list of documents    document   Let s start with dependencies  It will become clear why we use each of them   from nltk corpus import stopwords import string from nltk tokenize import wordpunct tokenize as tokenize from nltk stem porter import PorterStemmer from sklearn feature extraction text import TfidfVectorizer from scipy spatial distance import cosine   One of the approaches that can be uses is a bag-of-words approach  where we treat each word in the document independent of others and just throw all of them together in the big bag  From one point of view  it looses a lot of information  like how the words are connected   but from another point of view it makes the model simple   In English and in any other human language there are a lot of  useless  words like  a    the    in  which are so common that they do not possess a lot of meaning  They are called stop words and it is a good idea to remove them  Another thing that one can notice is that words like  analyze    analyzer    analysis  are really similar  They have a common root and all can be converted to just one word  This process is called stemming and there exist different stemmers which differ in speed  aggressiveness and so on  So we transform each of the documents to list of stems of words without stop words  Also we discard all the punctuation   porter   PorterStemmer   stop words   set stopwords words  english     modified arr     porter stem i lower    for i in tokenize d translate None  string punctuation   if i lower   not in stop words  for d in documents    So how will this bag of words help us  Imagine we have 3 bags   a  b  c    a  c  a  and  b  c  d   We can convert them to vectors in the basis  a  b  c  d   So we end up with vectors   1  1  1  0    2  0  1  0  and  0  1  1  1   The similar thing is with our documents  only the vectors will be way to longer   Now we see that we removed a lot of words and stemmed other also to decrease the dimensions of the vectors  Here there is just interesting observation  Longer documents will have way more positive elements than shorter  that s why it is nice to normalize the vector  This is called term frequency TF  people also used additional information about how often the word is used in other documents - inverse document frequency IDF  Together we have a metric TF-IDF which have a couple of flavors  This can be achieved with one line in sklearn  -     modified doc        join i  for i in modified arr    this is only to convert our list of lists to list of strings that vectorizer uses  tf idf   TfidfVectorizer   fit transform modified doc    Actually vectorizer allows to do a lot of things like removing stop words and lowercasing  I have done them in a separate step only because sklearn does not have non-english stopwords  but nltk has   So we have all the vectors calculated  The last step is to find which one is the most similar to the last one  There are various ways to achieve that  one of them is Euclidean distance which is not so great for the reason discussed here  Another approach is cosine similarity  We iterate all the documents and calculating cosine similarity between the document and the last one   l   len documents  - 1 for i in xrange l       minimum    1  None      minimum   min  cosine tf idf i  todense    tf idf l   1  todense     i   minimum  print minimum   Now minimum will have information about the best document and its score

[python] Python: tf-idf-cosine: to find document similarity

Examples related to python

Examples related to machine-learning

Examples related to nltk

Examples related to information-retrieval

Examples related to tf-idf