Similarity String Comparison in Java

Question

I want to compare several strings to each other  and find the ones that are the most similar   I was wondering if there is any library  method or best practice that would return me which strings are more similar to other strings   For example     The quick fox jumped  -   The fox jumped    The quick fox jumped  -    The fox    This comparison would return that the first is more similar than the second   I guess I need some method such as   double similarityIndex String s1  String s2    Is there such a thing somewhere   EDIT  Why am I doing this  I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks   Because the legacy system has a very limited field width  when the values are added the descriptions are abbreviated   I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys   It has drawbacks  as it has to be still manually checked  but it would save a lot of work

User · Answer

There are indeed a lot of string similarity measures out there:

Levenshtein edit distance;
Damerau-Levenshtein distance;
Jaro-Winkler similarity;
Longest Common Subsequence edit distance;
Q-Gram (Ukkonen);
n-Gram distance (Kondrak);
Jaccard index;
Sorensen-Dice coefficient;
Cosine similarity;
...

You can find explanation and java implementation of these here: https://github.com/tdebatty/java-string-similarity

User · Answer

Sounds like a plagiarism finder to me if your string turns into a document   Maybe searching with that term will turn up something good    Programming Collective Intelligence  has a chapter on determining whether two documents are similar   The code is in Python  but it s clean and easy to port

User · Answer

The common way of calculating the similarity between two strings in a 0 -100  fashion  as used in many libraries  is to measure how much  in    you d have to change the longer string to turn it into the shorter          Calculates the similarity  a number within 0 and 1  between two strings      public static double similarity String s1  String s2      String longer   s1  shorter   s2    if  s1 length    lt  s2 length         longer should always have greater length     longer   s2  shorter   s1        int longerLength   longer length      if  longerLength    0    return 1 0     both strings are zero length        return  longerLength - editDistance longer  shorter      double  longerLength       you can use StringUtils getLevenshteinDistance   as the editDistance   function    full copy-paste working code is below     Computing the editDistance     The editDistance   function above is expected to calculate the edit distance between the two strings  There are several implementations to this step  each may suit a specific scenario better  The most common is the Levenshtein distance algorithm and we ll use it in our example below  for very large strings  other algorithms are likely to perform better    Here s two options to calculate the edit distance    You can use Apache Commons Text s implementation of Levenshtein distance                     apply CharSequence left  CharSequence rightt  Implement it in your own  Below you ll find an example implementation      Working example   See online demo here   public class StringSimilarity               Calculates the similarity  a number within 0 and 1  between two strings          public static double similarity String s1  String s2        String longer   s1  shorter   s2      if  s1 length    lt  s2 length         longer should always have greater length       longer   s2  shorter   s1            int longerLength   longer length        if  longerLength    0    return 1 0     both strings are zero length                If you have Apache Commons Text  you can use it to calculate the edit distance      LevenshteinDistance levenshteinDistance   new LevenshteinDistance        return  longerLength - levenshteinDistance apply longer  shorter      double  longerLength         return  longerLength - editDistance longer  shorter      double  longerLength             Example implementation of the Levenshtein Edit Distance      See http   rosettacode org wiki Levenshtein distance Java   public static int editDistance String s1  String s2        s1   s1 toLowerCase        s2   s2 toLowerCase         int   costs   new int s2 length     1       for  int i   0  i  lt   s1 length    i            int lastValue   i        for  int j   0  j  lt   s2 length    j              if  i    0            costs j    j          else             if  j  gt  0                int newValue   costs j - 1               if  s1 charAt i - 1     s2 charAt j - 1                 newValue   Math min Math min newValue  lastValue                     costs j     1              costs j - 1    lastValue              lastValue   newValue                                      if  i  gt  0          costs s2 length      lastValue            return costs s2 length            public static void printSimilarity String s  String t        System out println String format           3f is the similarity between    s   and    s     similarity s  t   s  t           public static void main String   args        printSimilarity              printSimilarity  1234567890    1        printSimilarity  1234567890    123        printSimilarity  1234567890    1234567        printSimilarity  1234567890    1234567890        printSimilarity  1234567890    1234567980        printSimilarity  47 2010    472010        printSimilarity  47 2010    472011        printSimilarity  47 2010    AB CDEF        printSimilarity  47 2010    4B CDEFG        printSimilarity  47 2010    AB CDEFG        printSimilarity  The quick fox jumped    The fox jumped        printSimilarity  The quick fox jumped    The fox        printSimilarity  kitten    sitting             Output   1 000 is the similarity between    and    0 100 is the similarity between  1234567890  and  1  0 300 is the similarity between  1234567890  and  123  0 700 is the similarity between  1234567890  and  1234567  1 000 is the similarity between  1234567890  and  1234567890  0 800 is the similarity between  1234567890  and  1234567980  0 857 is the similarity between  47 2010  and  472010  0 714 is the similarity between  47 2010  and  472011  0 000 is the similarity between  47 2010  and  AB CDEF  0 125 is the similarity between  47 2010  and  4B CDEFG  0 000 is the similarity between  47 2010  and  AB CDEFG  0 700 is the similarity between  The quick fox jumped  and  The fox jumped  0 350 is the similarity between  The quick fox jumped  and  The fox  0 571 is the similarity between  kitten  and  sitting

User · Answer

Yes  there are many well documented algorithms like    Cosine similarity Jaccard similarity Dice s coefficient Matching similarity Overlap similarity etc etc   A good summary   Sam s String Metrics   can be found here  original link dead  so it links to Internet Archive   Also check these projects    Simmetrics jtmt

User · Answer

Theoretically  you can compare edit distances

User · Answer

This is typically done using an edit distance measure  Searching for  edit distance java  turns up a number of libraries  like this one

User · Answer

You could use Levenshtein distance to calculate the difference between two strings  http   en wikipedia org wiki Levenshtein distance

User · Answer

You can achieve this using the apache commons java library  Take a look at these two functions within it   - getLevenshteinDistance  - getFuzzyDistance

User · Answer

I translated the Levenshtein distance algorithm into JavaScript   String prototype LevenshteinDistance   function  s2        var array   new Array this length   1       for  var i   0  i  lt  this length   1  i            array i    new Array s2 length   1        for  var i   0  i  lt  this length   1  i            array i  0    i      for  var j   0  j  lt  s2 length   1  j            array 0  j    j       for  var i   1  i  lt  this length   1  i              for  var j   1  j  lt  s2 length   1  j                  if  this i - 1     s2 j - 1   array i  j    array i - 1  j - 1               else                   array i  j    Math min array i  j - 1    1  array i - 1  j    1                   array i  j    Math min array i  j   array i - 1  j - 1    1                                     return array this length  s2 length

User · Answer

You can also use z algorithm to find similarity in the string  Click here https   teakrunch com 2020 05 09 string-similarity-hackerrank-challenge

User · Answer

Thank to the first answerer  I think there are 2 calculations of computeEditDistance s1  s2   Due to high time spending of it   decided to improve the code s performance  So   public class LevenshteinDistance    public static int computeEditDistance String s1  String s2        s1   s1 toLowerCase        s2   s2 toLowerCase         int   costs   new int s2 length     1       for  int i   0  i  lt   s1 length    i              int lastValue   i          for  int j   0  j  lt   s2 length    j                  if  i    0                    costs j    j                else                   if  j  gt  0                        int newValue   costs j - 1                       if  s1 charAt i - 1     s2 charAt j - 1                             newValue   Math min Math min newValue  lastValue                                   costs j     1                                            costs j - 1    lastValue                      lastValue   newValue                                                    if  i  gt  0                costs s2 length      lastValue                      return costs s2 length        public static void printDistance String s1  String s2        double similarityOfStrings   0 0      int editDistance   0      if  s1 length    lt  s2 length         s1 should always be bigger         String swap   s1          s1   s2          s2   swap            int bigLen   s1 length        editDistance   computeEditDistance s1  s2       if  bigLen    0            similarityOfStrings   1 0     both strings are zero length          else           similarityOfStrings    bigLen - editDistance     double  bigLen                                             System out println s1    -- gt     s2                        editDistance          similarityOfStrings             System out println editDistance          similarityOfStrings            public static void main String   args        printDistance              printDistance  1234567890    1        printDistance  1234567890    12        printDistance  1234567890    123        printDistance  1234567890    1234        printDistance  1234567890    12345        printDistance  1234567890    123456        printDistance  1234567890    1234567        printDistance  1234567890    12345678        printDistance  1234567890    123456789        printDistance  1234567890    1234567890        printDistance  1234567890    1234567980         printDistance  47 2010    472010        printDistance  47 2010    472011         printDistance  47 2010    AB CDEF        printDistance  47 2010    4B CDEFG        printDistance  47 2010    AB CDEFG         printDistance  The quick fox jumped    The fox jumped        printDistance  The quick fox jumped    The fox        printDistance  The quick fox jumped                The quick fox jumped off the balcany        printDistance  kitten    sitting        printDistance  rosettacode    raisethysword        printDistance new StringBuilder  rosettacode   reverse   toString                new StringBuilder  raisethysword   reverse   toString         for  int i   1  i  lt  args length  i    2            printDistance args i - 1   args i

[java] Similarity String Comparison in Java

Examples related to java

Examples related to string-comparison