What s the fastest way in Python to calculate cosine similarity given sparse matrix data

Question

Given a sparse matrix listing  what s the best way to calculate the cosine similarity between each of the columns  or rows  in the matrix  I would rather not iterate n-choose-two times   Say the input matrix is   A    0 1 0 0 1  0 0 1 1 1  1 1 0 1 0    The sparse representation is   A    0  1 0  4 1  2 1  3 1  4 2  0 2  1 2  3   In Python  it s straightforward to work with the matrix-input format   import numpy as np from sklearn metrics import pairwise distances from scipy spatial distance import cosine  A   np array    0  1  0  0  1    0  0  1  1  1    1  1  0  1  0     dist out   1-pairwise distances A  metric  cosine   dist out   Gives   array    1            0 40824829   0 40824829            0 40824829   1            0 33333333            0 40824829   0 33333333   1               That s fine for a full-matrix input  but I really want to start with the sparse representation  due to the size and sparsity of my matrix   Any ideas about how this could best be accomplished  Thanks in advance

User · Answer

I suggest to run in two steps   1  generate mapping  A that maps A column index- non zero objects  2  for each object i  row  with non-zero occurrences columns   k1   kn  calculate cosine similarity just for elements in the union set A k1  U A k2  U   A kn   Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force

User · Answer

I have tried some methods above  However  the experiment by  zbinsd has its limitation  The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90   In my condition  the sparse is with the shape of  7000  25000  and the sparsity of 97   The method 4 is extremely slow and I can t tolerant getting the results  I use the method 6 which is finished in 10 s  Amazingly  I try the method below and it s finished in only 0 247 s   import sklearn preprocessing as pp  def cosine similarities mat       col normed mat   pp normalize mat tocsc    axis 0      return col normed mat T   col normed mat   This efficient method is linked by enter link description here

User · Answer

Hi you can do it this way      temp   sp coo matrix  data   row  col    shape  3  59       temp1   temp tocsr         Cosine similarity     row sums     temp1 multiply temp1   sum axis 1       rows sums sqrt   np array np sqrt row sums     0      row indices  col indices   temp1 nonzero       temp1 data    rows sums sqrt row indices      temp2   temp1 transpose       temp3   temp1 temp2

User · Answer

You should check out scipy sparse  You can apply operations on those sparse matrices just like how you use a normal matrix

User · Answer

The following method is about 30 times faster than scipy spatial distance pdist  It works pretty quickly on large matrices  assuming you have enough RAM   See below for a discussion of how to optimize for sparsity     base similarity matrix  all dot products    replace this with A dot A T  toarray   for sparse representation similarity   numpy dot A  A T      squared magnitude of preference vectors  number of occurrences  square mag   numpy diag similarity     inverse squared magnitude inv square mag   1   square mag    if it doesn t occur  set it s inverse magnitude to zero  instead of inf  inv square mag numpy isinf inv square mag     0    inverse of the magnitude inv mag   numpy sqrt inv square mag     cosine similarity  elementwise multiply by inverse magnitudes  cosine   similarity   inv mag cosine   cosine T   inv mag   If your problem is typical for large scale binary preference problems  you have a lot more entries in one dimension than the other  Also  the short dimension is the one whose entries you want to calculate similarities between  Let s call this dimension the  item  dimension   If this is the case  list your  items  in rows and create A using scipy sparse   Then replace the first line as indicated   If your problem is atypical you ll need more modifications  Those should be pretty straightforward replacements of basic numpy operations with their scipy sparse equivalents

User · Answer

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn   As of version 0 17 it also supports sparse output   from sklearn metrics pairwise import cosine similarity from scipy import sparse  A    np array   0  1  0  0  1    0  0  1  1  1   1  1  0  1  0    A sparse   sparse csr matrix A   similarities   cosine similarity A sparse  print  pairwise dense output  n    n  format similarities     also can output sparse matrices similarities sparse   cosine similarity A sparse dense output False  print  pairwise sparse output  n    n  format similarities sparse     Results   pairwise dense output     1           0 40824829  0 40824829    0 40824829  1           0 33333333    0 40824829  0 33333333  1             pairwise sparse output   0  1   0 408248290464  0  2   0 408248290464  0  0   1 0  1  0   0 408248290464  1  2   0 333333333333  1  1   1 0  2  1   0 333333333333  2  0   0 408248290464  2  2   1 0   If you want column-wise cosine similarities simply transpose your input matrix beforehand    A sparse transpose

User · Answer

def norm vector       return sqrt sum x   x for x in vector        def cosine similarity vec a  vec b           norm a   norm vec a          norm b   norm vec b          dot   sum a   b for a  b in zip vec a  vec b           return dot    norm a   norm b    This method seems to be somewhat faster than using sklearn s implementation if you pass in one pair of vectors at a time

User · Answer

I took all these answers and wrote a script to 1  validate each of the results  see assertion below  and 2  see which is the fastest  Code and results are below     Imports import numpy as np import scipy sparse as sp from scipy spatial distance import squareform  pdist from sklearn metrics pairwise import linear kernel from sklearn preprocessing import normalize from sklearn metrics pairwise import cosine similarity    Create an adjacency matrix np random seed 42  A   np random randint 0  2   10000  100   astype float  T    Make it sparse rows  cols   np where A  data   np ones len rows   Asp   sp csr matrix  data   rows  cols    shape    rows max   1  cols max   1    print  Input data shape    Asp shape    Define a function to calculate the cosine similarities a few different ways def calc sim A  method 1       if method    1          return 1 - squareform pdist A  metric  cosine        if method    2          Anorm   A   np linalg norm A  axis -1     np newaxis          return np dot Anorm  Anorm T      if method    3          Anorm   A   np linalg norm A  axis -1     np newaxis          return linear kernel Anorm      if method    4          similarity   np dot A  A T             squared magnitude of preference vectors  number of occurrences          square mag   np diag similarity             inverse squared magnitude         inv square mag   1   square mag            if it doesn t occur  set it s inverse magnitude to zero  instead of inf          inv square mag np isinf inv square mag     0            inverse of the magnitude         inv mag   np sqrt inv square mag             cosine similarity  elementwise multiply by inverse magnitudes          cosine   similarity   inv mag         return cosine T   inv mag     if method    5                      Just a version of method 4 that takes in sparse arrays                     similarity   A A T         square mag   np array A sum axis 1             inverse squared magnitude         inv square mag   1   square mag            if it doesn t occur  set it s inverse magnitude to zero  instead of inf          inv square mag np isinf inv square mag     0            inverse of the magnitude         inv mag   np sqrt inv square mag  T            cosine similarity  elementwise multiply by inverse magnitudes          cosine   np array similarity multiply inv mag           return cosine   inv mag T     if method    6          return cosine similarity A     Assert that all results are consistent with the first model   truth   for m in range 1  7       if m in  5     The sparse case         np testing assert allclose calc sim A  method 1   calc sim Asp  method m       else          np testing assert allclose calc sim A  method 1   calc sim A  method m      Time them  print  Method 1   timeit calc sim A  method 1  print  Method 2   timeit calc sim A  method 2  print  Method 3   timeit calc sim A  method 3  print  Method 4   timeit calc sim A  method 4  print  Method 5   timeit calc sim Asp  method 5  print  Method 6   timeit calc sim A  method 6    Results   Input data shape   100  10000  Method 1 10 loops  best of 3  71 3 ms per loop Method 2 100 loops  best of 3  8 2 ms per loop Method 3 100 loops  best of 3  8 6 ms per loop Method 4 100 loops  best of 3  2 54 ms per loop Method 5 10 loops  best of 3  73 7 ms per loop Method 6 10 loops  best of 3  77 3 ms per loop

User · Answer

Building off of Vaali s solution   def sparse cosine similarity sparse matrix       out    sparse matrix copy   if type sparse matrix  is csr matrix else            sparse matrix tocsr        squared   out multiply out      sqrt sum squared rows   np array np sqrt squared sum axis 1       0      row indices  col indices   out nonzero       out data    sqrt sum squared rows row indices      return out dot out T    This takes a sparse matrix  preferably a csr matrix  and returns a csr matrix  It should do the more intensive parts using sparse calculations with pretty minimal memory overhead  I haven t tested it extensively though  so caveat emptor  Update  I feel confident in this solution now that I ve tested and benchmarked it   Also  here is the sparse version of Waylon s solution in case it helps anyone  not sure which solution is actually better   def sparse cosine similarity b sparse matrix       input csr matrix   sparse matrix tocsr       similarity   input csr matrix   input csr matrix T     square mag   similarity diagonal       inv square mag   1   square mag     inv square mag np isinf inv square mag     0     inv mag   np sqrt inv square mag      return similarity multiply inv mag  T multiply inv mag    Both solutions seem to have parity with sklearn metrics pairwise cosine similarity   -D  Update   Now I have tested both solutions against my existing Cython implementation  https   github com davidmashburn sparse dot blob master test benchmarks v3 output table txt and it looks like the first algorithm performs the best of the three most of the time

[python] What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Examples related to python

Examples related to numpy

Examples related to pandas

Examples related to similarity

Examples related to cosine-similarity