The following method is about 30 times faster than scipy.spatial.distance.pdist
. It works pretty quickly on large matrices (assuming you have enough RAM)
See below for a discussion of how to optimize for sparsity.
# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = numpy.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = numpy.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[numpy.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = numpy.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag
If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.
If this is the case, list your 'items' in rows and create A
using scipy.sparse
. Then replace the first line as indicated.
If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy
operations with their scipy.sparse
equivalents.