Find the similarity metric between two strings

Question

How do I get the probability of a string being similar to another string in Python   I want to get a decimal value like 0 9  meaning 90   etc  Preferably with standard Python and library   e g   similar  Apple   Appel    would have a high prob   similar  Apple   Mango    would have a lower prob

User · Answer

Here s what i thought of  import string  def match a b       a b   a lower    b lower       error   0     for i in string ascii lowercase              error    abs a count i  - b count i       total   len a    len b      return  total-error  total  if   name       quot   main   quot       print match  quot pple inc quot    quot Apple Inc  quot

User · Answer

Package distance includes Levenshtein distance   import distance distance levenshtein  lenvestein    levenshtein     3

User · Answer

Solution  1  Python builtin  use SequenceMatcher from difflib    pros      native python library  no need extra package  cons   too limited  there are so many other good algorithms for string similarity out there      example     gt  gt  gt  from difflib import SequenceMatcher  gt  gt  gt  s   SequenceMatcher None   abcd    bcde    gt  gt  gt  s ratio   0 75   Solution  2  jellyfish library  its a very good library with good coverage and few issues  it supports  - Levenshtein Distance - Damerau-Levenshtein Distance - Jaro Distance - Jaro-Winkler Distance - Match Rating Approach Comparison - Hamming Distance    pros      easy to use  gamut of supported  algorithms  tested  cons   not native library   example       gt  gt  gt  import jellyfish  gt  gt  gt  jellyfish levenshtein distance u jellyfish   u smellyfish   2  gt  gt  gt  jellyfish jaro distance u jellyfish   u smellyfish   0 89629629629629637  gt  gt  gt  jellyfish damerau levenshtein distance u jellyfish   u jellyfihs   1

User · Answer

Textdistance  TextDistance     python library for comparing distance between two or more sequences by many algorithms  It has Textdistance  30  algorithms Pure python implementation Simple usage More than two sequences comparing Some algorithms have more than one implementation in one class  Optional numpy usage for maximum speed   Example1  import textdistance textdistance hamming  test    text    Output  1 Example2  import textdistance  textdistance hamming normalized similarity  test    text    Output  0 75 Thanks and Cheers

User · Answer

There is a built in   from difflib import SequenceMatcher  def similar a  b       return SequenceMatcher None  a  b  ratio     Using it    gt  gt  gt  similar  Apple   Appel   0 8  gt  gt  gt  similar  Apple   Mango   0 0

User · Answer

You can create a function like   def similar w1  w2       w1   w1          len w2  - len w1       w2   w2          len w1  - len w2       return sum 1 if i    j else 0 for i  j in zip w1  w2     float len w1

User · Answer

Note  difflib SequenceMatcher only finds the longest contiguous matching subsequence  this is often not what is desired  for example    gt  gt  gt  a1    Apple   gt  gt  gt  a2    Appel   gt  gt  gt  a1    50  gt  gt  gt  a2    50  gt  gt  gt  SequenceMatcher None  a1  a2  ratio   0 012    very low  gt  gt  gt  SequenceMatcher None  a1  a2  get matching blocks    Match a 0  b 0  size 3   Match a 250  b 250  size 0      only the first block is recorded   Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics  There are many dedicated libraries for this including biopython  This example implements the Needleman Wunsch algorithm    gt  gt  gt  from Bio Align import PairwiseAligner  gt  gt  gt  aligner   PairwiseAligner    gt  gt  gt  aligner score a1  a2  200 0  gt  gt  gt  aligner algorithm  Needleman-Wunsch    Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available  Also  you can actually get the matching sequences to visualise what is happening    gt  gt  gt  alignment   next aligner align a1  a2    gt  gt  gt  alignment score 200 0  gt  gt  gt  print alignment  Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-    - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - -   - - App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

User · Answer

Fuzzy Wuzzy is a package that implements Levenshtein distance in python  with some helper functions to help in certain situations where you may want two distinct strings to be considered identical  For example    gt  gt  gt  fuzz ratio  fuzzy wuzzy was a bear    wuzzy fuzzy was a bear       91  gt  gt  gt  fuzz token sort ratio  fuzzy wuzzy was a bear    wuzzy fuzzy was a bear       100

User · Answer

You can find most of the text similarity methods and how they are calculated under this link  https   github com luozhouyang python-string-similarity python-string-similarity Here some examples    Normalized  metric  similarity and distance     Normalized  similarity and distance Metric distances Shingles  n-gram  based similarity and distance     Levenshtein  Normalized Levenshtein  Weighted Levenshtein    Damerau-Levenshtein  Optimal String Alignment  Jaro-Winkler  Longest Common Subsequence  Metric Longest Common Subsequence  N-Gram  Shingle n-gram  based algorithms Q-Gram Cosine similarity Jaccard index    Sorensen-Dice coefficient    Overlap coefficient  i e  Szymkiewicz-Simpson

User · Answer

I think maybe you are looking for an algorithm describing the distance between strings  Here are some you may refer to      Hamming distance   Levenshtein distance Damerau   Levenshtein distance   Jaro   Winkler distance

User · Answer

BLEUscore  BLEU  or the Bilingual Evaluation Understudy  is a score for comparing a candidate translation of text to one or more reference translations  A perfect match results in a score of 1 0  whereas a perfect mismatch results in a score of 0 0  Although developed for translation  it can be used to evaluate text generated for a suite of natural language processing tasks   Code  import nltk from nltk translate import bleu from nltk translate bleu score import SmoothingFunction smoothie   SmoothingFunction   method4  C1  Text  C2  Best   print  BLEUscore   bleu  C1   C2  smoothing function smoothie    Examples  By updating C1 and C2  C1  Test  C2  Test   BLEUscore  1 0  C1  Test  C2  Best   BLEUscore  0 2326589746035907  C1  Test  C2  Text   BLEUscore  0 2866227639866161  You can also compare sentence similarity  C1  It is tough   C2  It is rough    BLEUscore  0 7348889200874658  C1  It is tough   C2  It is tough    BLEUscore  1 0

User · Answer

The builtin SequenceMatcher is very slow on large input  here s how it can be done with diff-match-patch   from diff match patch import diff match patch  def compute similarity and diff text1  text2       dmp   diff match patch       dmp Diff Timeout   0 0     diff   dmp diff main text1  text2  False         similarity     common text   sum  len txt  for op  txt in diff if op    0       text length   max len text1   len text2       sim   common text   text length      return sim  diff

User · Answer

There are many metrics to define similarity and distance between strings as mentioned above  I will give my 5 cents by showing an example of Jaccard similarity with Q-Grams and an example with edit distance  The libraries from nltk metrics distance import jaccard distance from nltk util import ngrams from nltk metrics distance  import edit distance  Jaccard Similarity 1-jaccard distance set ngrams  Apple   2    set ngrams  Appel   2     and we get  0 33333333333333337  And for the Apple and Mango 1-jaccard distance set ngrams  Apple   2    set ngrams  Mango   2     and we get  0 0  Edit Distance edit distance  Apple    Appel    and we get  2  And finally  edit distance  Apple    Mango    and we get  5  Cosine Similarity on Q-Grams  q 2  Another solution is to work with the textdistance library  I will provide an example of Cosine Similarity import textdistance 1-textdistance Cosine qval 2  distance  Apple    Appel    and we get  0 5

[python] Find the similarity metric between two strings

Examples related to python

Examples related to probability

Examples related to similarity

Examples related to metric