I know its an old post. but I tried the http://scikit-learn.sourceforge.net/stable/ package. here is my code to find the cosine similarity. The question was how will you calculate the cosine similarity with this package and here is my code for that
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")
train_set = ["president of India",doc1, doc2, doc3]
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set) #finds the tfidf score with normalization
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train) #here the first element of tfidf_matrix_train is matched with other three elements
Here suppose the query is the first element of train_set and doc1,doc2 and doc3 are the documents which I want to rank with the help of cosine similarity. then I can use this code.
Also the tutorials provided in the question was very useful. Here are all the parts for it part-I,part-II,part-III
the output will be as follows :
[[ 1. 0.07102631 0.02731343 0.06348799]]
here 1 represents that query is matched with itself and the other three are the scores for matching the query with the respective documents.