[python] Python: tf-idf-cosine: to find document similarity

I know its an old post. but I tried the http://scikit-learn.sourceforge.net/stable/ package. here is my code to find the cosine similarity. The question was how will you calculate the cosine similarity with this package and here is my code for that

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")

train_set = ["president of India",doc1, doc2, doc3]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)  #finds the tfidf score with normalization
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other three elements

Here suppose the query is the first element of train_set and doc1,doc2 and doc3 are the documents which I want to rank with the help of cosine similarity. then I can use this code.

Also the tutorials provided in the question was very useful. Here are all the parts for it part-I,part-II,part-III

the output will be as follows :

[[ 1.          0.07102631  0.02731343  0.06348799]]

here 1 represents that query is matched with itself and the other three are the scores for matching the query with the respective documents.

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to machine-learning

Error in Python script "Expected 2D array, got 1D array instead:"? How to predict input image using trained model in Keras? What is the role of "Flatten" in Keras? How to concatenate two layers in keras? How to save final model using keras? scikit-learn random state in splitting dataset Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? What is the meaning of the word logits in TensorFlow? Can anyone explain me StandardScaler? Can Keras with Tensorflow backend be forced to use CPU or GPU at will?

Examples related to nltk

re.sub erroring with "Expected string or bytes-like object" how to check which version of nltk, scikit learn installed? Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) NLTK and Stopwords Fail #lookuperror Resource u'tokenizers/punkt/english.pickle' not found How do I download NLTK data? Stopword removal with NLTK n-grams in python, four, five, six grams? pip issue installing almost any library How to get rid of punctuation using NLTK tokenizer?

Examples related to information-retrieval

Python: tf-idf-cosine: to find document similarity

Examples related to tf-idf

Python: tf-idf-cosine: to find document similarity