[python] Python - difference between two strings

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojezyczny and many of words like afrykanerskojezycznym, afrykanerskojezyczni, nieafrykanerskojezyczni. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?

This question is related to python string python-3.x diff

The answer is


What you are asking for is a specialized form of compression. xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. You'd want to use zlib.compressobj and zlib.decompressobj with the zdict parameter set to your "base word", e.g. afrykanerskojezyczny.

Caveats are zdict is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.


I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:

import difflib

case_a = 'afrykbnerskojezyczny'
case_b = 'afrykanerskojezycznym'

output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']

You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:

import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni', 
            'nieafrykanerskojezyczni' ]
for q in queries:
    m = regex.search(r'(%s){e<=2}'%q, sequence)
    print 'match' if m else 'nomatch'

You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.

A simple example:

import difflib

cases=[('afrykanerskojezyczny', 'afrykanerskojezycznym'),
       ('afrykanerskojezyczni', 'nieafrykanerskojezyczni'),
       ('afrykanerskojezycznym', 'afrykanerskojezyczny'),
       ('nieafrykanerskojezyczni', 'afrykanerskojezyczni'),
       ('nieafrynerskojezyczni', 'afrykanerskojzyczni'),
       ('abcdefg','xac')] 

for a,b in cases:     
    print('{} => {}'.format(a,b))  
    for i,s in enumerate(difflib.ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    
    print()      

prints:

afrykanerskojezyczny => afrykanerskojezycznym
Add "m" to position 20

afrykanerskojezyczni => nieafrykanerskojezyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2

afrykanerskojezycznym => afrykanerskojezyczny
Delete "m" from position 20

nieafrykanerskojezyczni => afrykanerskojezyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2

nieafrynerskojezyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "e" from position 16

abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7

The answer to my comment above on the Original Question makes me think this is all he wants:

loopnum = 0
word = 'afrykanerskojezyczny'
wordlist = ['afrykanerskojezycznym','afrykanerskojezyczni','nieafrykanerskojezyczni']
for i in wordlist:
    wordlist[loopnum] = word
    loopnum += 1

This will do the following:

For every value in wordlist, set that value of the wordlist to the origional code.

All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.

Hope this helps!


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to python-3.x

Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation Replace specific text with a redacted version using Python Upgrade to python 3.8 using conda "Permission Denied" trying to run Python on Windows 10 Python: 'ModuleNotFoundError' when trying to import module from imported package What is the meaning of "Failed building wheel for X" in pip install? How to downgrade python from 3.7 to 3.6 I can't install pyaudio on Windows? How to solve "error: Microsoft Visual C++ 14.0 is required."? Iterating over arrays in Python 3 How to upgrade Python version to 3.7?

Examples related to diff

Create patch or diff file from git repository and apply it to another different git repository Comparing the contents of two files in Sublime Text Git diff between current branch and master but not including unmerged master commits Fast way of finding lines in one file that are not in another? Python - difference between two strings How to see the changes in a Git commit? unix diff side-to-side results? Find the files existing in one directory but not in the other git diff between two different files How to get the difference (only additions) between two files in linux