[python] What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

A= 
[0 1 0 0 1
 0 0 1 1 1
 1 1 0 1 0]

The sparse representation is:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3

In Python, it's straightforward to work with the matrix-input format:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

Gives:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.

This question is related to python numpy pandas similarity cosine-similarity

The answer is


I took all these answers and wrote a script to 1. validate each of the results (see assertion below) and 2. see which is the fastest. Code and results are below:

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print "Input data shape:", Asp.shape

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        return 1 - squareform(pdist(A, metric='cosine'))
    if method == 2:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return np.dot(Anorm, Anorm.T)
    if method == 3:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return linear_kernel(Anorm)
    if method == 4:
        similarity = np.dot(A, A.T)

        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)

        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 5:
        '''
        Just a version of method 4 that takes in sparse arrays
        '''
        similarity = A*A.T
        square_mag = np.array(A.sum(axis=1))
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag).T

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = np.array(similarity.multiply(inv_mag))
        return cosine * inv_mag.T
    if method == 6:
        return cosine_similarity(A)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 7):
    if m in [5]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print "Method 1"
%timeit calc_sim(A, method=1)
print "Method 2"
%timeit calc_sim(A, method=2)
print "Method 3"
%timeit calc_sim(A, method=3)
print "Method 4"
%timeit calc_sim(A, method=4)
print "Method 5"
%timeit calc_sim(Asp, method=5)
print "Method 6"
%timeit calc_sim(A, method=6)

Results:

Input data shape: (100, 10000)
Method 1
10 loops, best of 3: 71.3 ms per loop
Method 2
100 loops, best of 3: 8.2 ms per loop
Method 3
100 loops, best of 3: 8.6 ms per loop
Method 4
100 loops, best of 3: 2.54 ms per loop
Method 5
10 loops, best of 3: 73.7 ms per loop
Method 6
10 loops, best of 3: 77.3 ms per loop

Hi you can do it this way

    temp = sp.coo_matrix((data, (row, col)), shape=(3, 59))
    temp1 = temp.tocsr()

    #Cosine similarity
    row_sums = ((temp1.multiply(temp1)).sum(axis=1))
    rows_sums_sqrt = np.array(np.sqrt(row_sums))[:,0]
    row_indices, col_indices = temp1.nonzero()
    temp1.data /= rows_sums_sqrt[row_indices]
    temp2 = temp1.transpose()
    temp3 = temp1*temp2

The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

See below for a discussion of how to optimize for sparsity.

# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = numpy.dot(A, A.T)


# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.

If this is the case, list your 'items' in rows and create A using scipy.sparse. Then replace the first line as indicated.

If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.


You should check out scipy.sparse. You can apply operations on those sparse matrices just like how you use a normal matrix.


Building off of Vaali's solution:

def sparse_cosine_similarity(sparse_matrix):
    out = (sparse_matrix.copy() if type(sparse_matrix) is csr_matrix else
           sparse_matrix.tocsr())
    squared = out.multiply(out)
    sqrt_sum_squared_rows = np.array(np.sqrt(squared.sum(axis=1)))[:, 0]
    row_indices, col_indices = out.nonzero()
    out.data /= sqrt_sum_squared_rows[row_indices]
    return out.dot(out.T)

This takes a sparse matrix (preferably a csr_matrix) and returns a csr_matrix. It should do the more intensive parts using sparse calculations with pretty minimal memory overhead. I haven't tested it extensively though, so caveat emptor (Update: I feel confident in this solution now that I've tested and benchmarked it)

Also, here is the sparse version of Waylon's solution in case it helps anyone, not sure which solution is actually better.

def sparse_cosine_similarity_b(sparse_matrix):
    input_csr_matrix = sparse_matrix.tocsr()
    similarity = input_csr_matrix * input_csr_matrix.T
    square_mag = similarity.diagonal()
    inv_square_mag = 1 / square_mag
    inv_square_mag[np.isinf(inv_square_mag)] = 0
    inv_mag = np.sqrt(inv_square_mag)
    return similarity.multiply(inv_mag).T.multiply(inv_mag)

Both solutions seem to have parity with sklearn.metrics.pairwise.cosine_similarity

:-D

Update:

Now I have tested both solutions against my existing Cython implementation: https://github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txt and it looks like the first algorithm performs the best of the three most of the time.


I suggest to run in two steps:

1) generate mapping A that maps A:column index->non zero objects

2) for each object i (row) with non-zero occurrences(columns) {k1,..kn} calculate cosine similarity just for elements in the union set A[k1] U A[k2] U.. A[kn]

Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force


def norm(vector):
    return sqrt(sum(x * x for x in vector))    

def cosine_similarity(vec_a, vec_b):
        norm_a = norm(vec_a)
        norm_b = norm(vec_b)
        dot = sum(a * b for a, b in zip(vec_a, vec_b))
        return dot / (norm_a * norm_b)

This method seems to be somewhat faster than using sklearn's implementation if you pass in one pair of vectors at a time.


I have tried some methods above. However, the experiment by @zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%. In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can't tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it's finished in only 0.247 s.

import sklearn.preprocessing as pp

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

This efficient method is linked by enter link description here


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to similarity

What's the fastest way in Python to calculate cosine similarity given sparse matrix data? Find the similarity metric between two strings Calculate cosine similarity given 2 sentence strings Checking images for similarity with OpenCV

Examples related to cosine-similarity

Cosine Similarity between 2 Number Lists What's the fastest way in Python to calculate cosine similarity given sparse matrix data? Calculate cosine similarity given 2 sentence strings Can someone give an example of cosine similarity, in a very simple, graphical way?