Can someone give an example of cosine similarity in a very simple graphical way

Question

Cosine Similarity article on Wikipedia  Can you show the vectors here  in a list or something  and then do the math  and let us see how it works   I m a beginner

User · Answer

Simple JAVA code to calculate cosine similarity           Method to calculate cosine similarity of vectors      1 - exactly similar  angle between them is 0       0 - orthogonal vectors  angle between them is 90        param vector1 - vector in the form  a1  a2  a3        an        param vector2 - vector in the form  b1  b2  b3        bn        return - the cosine similarity of vectors  ranges from 0 to 1          private double cosineSimilarity List lt Double gt  vector1  List lt Double gt  vector2         double dotProduct   0 0      double normA   0 0      double normB   0 0      for  int i   0  i  lt  vector1 size    i            dotProduct    vector1 get i    vector2 get i         normA    Math pow vector1 get i   2         normB    Math pow vector2 get i   2             return dotProduct    Math sqrt normA    Math sqrt normB

User · Answer

Here s my implementation in C      using System   namespace CosineSimilarity       class Program               static void Main                         int   vecA    1  2  3  4  5               int   vecB    6  7  7  9  10                var cosSimilarity   CalculateCosineSimilarity vecA  vecB                Console WriteLine cosSimilarity               Console Read                       private static double CalculateCosineSimilarity int   vecA  int   vecB                        var dotProduct   DotProduct vecA  vecB               var magnitudeOfA   Magnitude vecA               var magnitudeOfB   Magnitude vecB                return dotProduct  magnitudeOfA magnitudeOfB                      private static double DotProduct int   vecA  int   vecB                           I m not validating inputs here for simplicity                          double dotProduct   0              for  var i   0  i  lt  vecA Length  i                                  dotProduct     vecA i    vecB i                               return dotProduct                        Magnitude of the vector is the square root of the dot product of the vector with itself          private static double Magnitude int   vector                        return Math Sqrt DotProduct vector  vector

User · Answer

I m guessing you are more interested in getting some insight into  why  the cosine similarity works  why it provides a good indication of similarity   rather than  how  it is calculated  the specific operations used for the calculation    If your interest is in the latter  see the reference indicated by Daniel in this post  as well as a related SO Question   To explain both the how and even more so the why  it is useful  at first  to simplify the problem and to work only in two dimensions  Once you get this in 2D  it is easier to think of it in three dimensions  and of course harder to imagine in many more dimensions  but by then we can use linear algebra to do the numeric calculations and also to help us think in terms of lines   vectors    planes     spheres  in n dimensions  even though we can t draw these   So  in two dimensions  with regards to text similarity this means that we would focus on two distinct terms  say the words  London  and  Paris   and we d count how many times each of these words is found in each of the two documents we wish to compare   This gives us  for each document  a point in the the x-y plane  For example  if Doc1 had Paris once  and London four times  a point at  1 4  would present this document  with regards to this diminutive evaluation of documents    Or  speaking in terms of vectors  this Doc1 document would be an arrow going from the origin to point  1 4    With this image in mind  let s think about what it means for two documents to be similar and how this relates to the vectors   VERY similar documents  again with regards to this limited set of dimensions  would have the very same number of references to Paris  AND the very same number of references to London  or maybe  they could have the same ratio of these references  A Document  Doc2  with 2 refs to Paris and 8 refs to London  would also be very similar  only with maybe a longer text or somehow more repetitive of the cities  names  but in the same proportion  Maybe both documents are guides about London  only making passing references to Paris  and how uncool that city is  -  Just kidding      Now  less similar documents may also include references to both cities  but in different proportions  Maybe Doc2 would only cite Paris once and London seven times   Back to our x-y plane  if we draw these hypothetical documents  we see that when they are VERY similar  their vectors overlap  though some vectors may be longer   and as they start to have less in common  these vectors start to diverge  to have a wider angle between them   By measuring the angle between the vectors  we can get a good idea of their similarity  and to make things even easier  by taking the Cosine of this angle  we have a nice 0 to 1 or -1 to 1 value that is indicative of this similarity  depending on what and how we account for   The smaller the angle  the bigger  closer to 1  the cosine value  and also the higher the similarity   At the extreme  if Doc1 only cites Paris and Doc2 only cites London  the documents have absolutely nothing in common   Doc1 would have its vector on the x-axis  Doc2 on the y-axis  the angle 90 degrees  Cosine 0  In this case we d say that these documents are orthogonal to one another   Adding dimensions  With this intuitive feel for similarity expressed as a small angle  or large cosine   we can now imagine things in 3 dimensions  say by bringing the word  Amsterdam  into the mix  and visualize quite well how a document with two references to each would have a vector going in a particular direction  and we can see how this direction would compare to a document citing Paris and London three times each  but not Amsterdam  etc  As said  we can try and imagine the this fancy space for 10 or 100 cities  It s hard to draw  but easy to conceptualize   I ll wrap up just by saying a few words about the formula itself  As I ve said  other references provide good information about the calculations   First in two dimensions  The formula for the Cosine of the angle between two vectors is derived from the trigonometric difference  between angle a and angle b    cos a - b     cos a    cos b      sin  a    sin b     This formula looks very similar to the dot product formula   Vect1   Vect2     x1   x2     y1   y2    where cos a  corresponds to the x value and sin a  the y value  for the first vector  etc   The only problem  is that x  y  etc  are not exactly the cos and sin values  for these values need to be read on the unit circle  That s where the denominator of the formula kicks in  by dividing by the product of the length of these vectors  the x and y coordinates become normalized

User · Answer

For simplicity I am reducing the vector a and b   Let       a    1  1  0      b    1  0  1    Then cosine similarity  Theta      Theta     1 1   1 0   0 1  sqrt  1 2   1 2    sqrt  1 2   1 2     1 2   0 5   then inverse of cos 0 5 is 60 degrees

User · Answer

import java util HashMap  import java util HashSet  import java util Map  import java util Set              author Xiao Ma   mail   409791952 qq com        public class SimilarityUtil    public static double consineTextSimilarity String   left  String   right        Map lt String  Integer gt  leftWordCountMap   new HashMap lt String  Integer gt         Map lt String  Integer gt  rightWordCountMap   new HashMap lt String  Integer gt         Set lt String gt  uniqueSet   new HashSet lt String gt         Integer temp   null      for  String leftWord   left            temp   leftWordCountMap get leftWord           if  temp    null                leftWordCountMap put leftWord  1               uniqueSet add leftWord             else               leftWordCountMap put leftWord  temp   1                       for  String rightWord   right            temp   rightWordCountMap get rightWord           if  temp    null                rightWordCountMap put rightWord  1               uniqueSet add rightWord             else               rightWordCountMap put rightWord  temp   1                       int   leftVector   new int uniqueSet size         int   rightVector   new int uniqueSet size         int index   0      Integer tempCount   0      for  String uniqueWord   uniqueSet            tempCount   leftWordCountMap get uniqueWord           leftVector index    tempCount    null   0   tempCount          tempCount   rightWordCountMap get uniqueWord           rightVector index    tempCount    null   0   tempCount          index              return consineVectorSimilarity leftVector  rightVector             The resulting similarity ranges from -1 meaning exactly opposite  to 1    meaning exactly the same  with 0 usually indicating independence  and    in-between values indicating intermediate similarity or dissimilarity         For text matching  the attribute vectors A and B are usually the term    frequency vectors of the documents  The cosine similarity can be seen as    a method of normalizing document length during comparison         In the case of information retrieval  the cosine similarity of two    documents will range from 0 to 1  since the term frequencies  tf-idf    weights  cannot be negative  The angle between two term frequency vectors    cannot be greater than 90            param leftVector     param rightVector     return     private static double consineVectorSimilarity int   leftVector          int   rightVector        if  leftVector length    rightVector length          return 1      double dotProduct   0      double leftNorm   0      double rightNorm   0      for  int i   0  i  lt  leftVector length  i              dotProduct    leftVector i    rightVector i           leftNorm    leftVector i    leftVector i           rightNorm    rightVector i    rightVector i              double result   dotProduct                Math sqrt leftNorm    Math sqrt rightNorm        return result     public static void main String   args        String left        Julie    loves    me    more    than    Linda                loves    me         String right        Jane    likes    me    more    than    Julie                loves    me         System out println consineTextSimilarity left right

User · Answer

This Python code is my quick and dirty attempt to implement the algorithm   import math from collections import Counter  def build vector iterable1  iterable2       counter1   Counter iterable1      counter2   Counter iterable2      all items   set counter1 keys    union set counter2 keys         vector1    counter1 k  for k in all items      vector2    counter2 k  for k in all items      return vector1  vector2  def cosim v1  v2       dot product   sum n1   n2 for n1  n2 in zip v1  v2        magnitude1   math sqrt sum n    2 for n in v1       magnitude2   math sqrt sum n    2 for n in v2       return dot product    magnitude1   magnitude2    l1    Julie loves me more than Linda loves me  split   l2    Jane likes me more than Julie loves me or  split     v1  v2   build vector l1  l2  print cosim v1  v2

User · Answer

Here s a simple Python code to calculate cosine similarity  import math  def dot prod v1  v2       ret   0     for i in range len v1            ret    v1 i    v2 i      return ret  def magnitude v       ret   0     for i in v          ret    i  2     return math sqrt ret   def cos sim v1  v2       return  dot prod v1  v2      magnitude v1    magnitude v2

User · Answer

This is a simple Python code which implements cosine similarity   from scipy import linalg  mat  dot import numpy as np  In  12   matrix   mat    2  1  0  2  0  1  1  1   2  1  1  1  1  0  1  1      In  13   matrix Out 13    matrix   2  1  0  2  0  1  1  1            2  1  1  1  1  0  1  1    In  14   dot matrix 0  matrix 1  T  np linalg norm matrix 0   np linalg norm matrix 1   Out 14   matrix    0 82158384

User · Answer

Using  Bill Bell example  two ways to do this in  R   a   c 2 1 0 2 0 1 1 1   b   c 2 1 1 1 1 0 1 1   d    a     b     sqrt sum a 2     sqrt sum b 2      or taking advantage of crossprod   method s performance     e   crossprod a  b     sqrt crossprod a  a     sqrt crossprod b  b

User · Answer

Two Vectors A and B exists in a 2D space or 3D space  the angle between those vectors is cos similarity    If the angle is more  can reach max 180 degree  which is Cos 180 -1 and the minimum angle is 0 degree  cos 0  1 implies the vectors are aligned to each other and hence the vectors are similar    cos 90 0  which is sufficient to conclude that the vectors A and B are not similar at all and since distance cant be negative  the cosine values will lie from 0 to 1  Hence  more angle implies implies reducing similarity  visualising also it makes sense

User · Answer

Here are two very short texts to compare    Julie loves me more than Linda loves me Jane likes me more than Julie loves me   We want to know how similar these texts are  purely in terms of word counts  and ignoring word order   We begin by making a list of the words from both texts   me Julie loves Linda than more likes Jane   Now we count the number of times each of these words appears in each text      me   2   2  Jane   0   1 Julie   1   1 Linda   1   0 likes   0   1 loves   2   1  more   1   1  than   1   1   We are not interested in the words themselves though  We are interested only in those two vertical vectors of counts  For instance  there are two instances of  me  in each text  We are going to decide how close these two texts are to each other by calculating one function of those two vectors  namely the cosine of the angle between them    The two vectors are  again   a   2  0  1  1  0  2  1  1   b   2  1  1  0  1  1  1  1    The cosine of the angle between them is about 0 822   These vectors are 8-dimensional  A virtue of using cosine similarity is clearly that it converts a question that is beyond human ability to visualise to one that can be  In this case you can think of this as the angle of about 35 degrees which is some  distance  from zero or perfect agreement

[text] Can someone give an example of cosine similarity, in a very simple, graphical way?

Examples related to text

Examples related to data-mining

Examples related to cosine-similarity