I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojezyczny
and many of words like afrykanerskojezycznym
, afrykanerskojezyczni
, nieafrykanerskojezyczni
. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?
This question is related to
python
string
python-3.x
diff
What you are asking for is a specialized form of compression. xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. You'd want to use zlib.compressobj
and zlib.decompressobj
with the zdict
parameter set to your "base word", e.g. afrykanerskojezyczny
.
Caveats are zdict
is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.
I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:
import difflib
case_a = 'afrykbnerskojezyczny'
case_b = 'afrykanerskojezycznym'
output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']
You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:
import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni',
'nieafrykanerskojezyczni' ]
for q in queries:
m = regex.search(r'(%s){e<=2}'%q, sequence)
print 'match' if m else 'nomatch'
You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.
A simple example:
import difflib
cases=[('afrykanerskojezyczny', 'afrykanerskojezycznym'),
('afrykanerskojezyczni', 'nieafrykanerskojezyczni'),
('afrykanerskojezycznym', 'afrykanerskojezyczny'),
('nieafrykanerskojezyczni', 'afrykanerskojezyczni'),
('nieafrynerskojezyczni', 'afrykanerskojzyczni'),
('abcdefg','xac')]
for a,b in cases:
print('{} => {}'.format(a,b))
for i,s in enumerate(difflib.ndiff(a, b)):
if s[0]==' ': continue
elif s[0]=='-':
print(u'Delete "{}" from position {}'.format(s[-1],i))
elif s[0]=='+':
print(u'Add "{}" to position {}'.format(s[-1],i))
print()
prints:
afrykanerskojezyczny => afrykanerskojezycznym
Add "m" to position 20
afrykanerskojezyczni => nieafrykanerskojezyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2
afrykanerskojezycznym => afrykanerskojezyczny
Delete "m" from position 20
nieafrykanerskojezyczni => afrykanerskojezyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
nieafrynerskojezyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "e" from position 16
abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7
The answer to my comment above on the Original Question makes me think this is all he wants:
loopnum = 0
word = 'afrykanerskojezyczny'
wordlist = ['afrykanerskojezycznym','afrykanerskojezyczni','nieafrykanerskojezyczni']
for i in wordlist:
wordlist[loopnum] = word
loopnum += 1
This will do the following:
For every value in wordlist, set that value of the wordlist to the origional code.
All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.
Hope this helps!
Source: Stackoverflow.com