I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.
-
or space?I want to understand the best output structure in order to figure out how to create and use one.
I would also appreciate what should be the output of a DAWG along with trie.
I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.
There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.
I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.
Here is a list of python packages that implement Trie:
from collections import defaultdict
_trie = lambda: defaultdict(_trie)
trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
curr = trie
for c in s:
curr = curr[c]
curr.setdefault("_end")
def word_exist(trie, word):
curr = trie
for w in word:
if w not in curr:
return False
curr = curr[w]
return '_end' in curr
print(word_exist(trie, 'cam'))
Modified from senderle
's method (above). I found that Python's defaultdict
is ideal for creating a trie or a prefix tree.
from collections import defaultdict
class Trie:
"""
Implement a trie with insert, search, and startsWith methods.
"""
def __init__(self):
self.root = defaultdict()
# @param {string} word
# @return {void}
# Inserts a word into the trie.
def insert(self, word):
current = self.root
for letter in word:
current = current.setdefault(letter, {})
current.setdefault("_end")
# @param {string} word
# @return {boolean}
# Returns if the word is in the trie.
def search(self, word):
current = self.root
for letter in word:
if letter not in current:
return False
current = current[letter]
if "_end" in current:
return True
return False
# @param {string} prefix
# @return {boolean}
# Returns if there is any word in the trie
# that starts with the given prefix.
def startsWith(self, prefix):
current = self.root
for letter in prefix:
if letter not in current:
return False
current = current[letter]
return True
# Now test the class
test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')
print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')
Using defaultdict and reduce function.
Create Trie
from functools import reduce
from collections import defaultdict
T = lambda : defaultdict(T)
trie = T()
reduce(dict.__getitem__,'how',trie)['isEnd'] = True
Trie :
defaultdict(<function __main__.<lambda>()>,
{'h': defaultdict(<function __main__.<lambda>()>,
{'o': defaultdict(<function __main__.<lambda>()>,
{'w': defaultdict(<function __main__.<lambda>()>,
{'isEnd': True})})})})
Search In Trie :
curr = trie
for w in 'how':
if w in curr:
curr = curr[w]
else:
print("Not Found")
break
if curr['isEnd']:
print('Found')
If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:
class Trie:
def __init__(self):
self.__final = False
self.__nodes = {}
def __repr__(self):
return 'Trie<len={}, final={}>'.format(len(self), self.__final)
def __getstate__(self):
return self.__final, self.__nodes
def __setstate__(self, state):
self.__final, self.__nodes = state
def __len__(self):
return len(self.__nodes)
def __bool__(self):
return self.__final
def __contains__(self, array):
try:
return self[array]
except KeyError:
return False
def __iter__(self):
yield self
for node in self.__nodes.values():
yield from node
def __getitem__(self, array):
return self.__get(array, False)
def create(self, array):
self.__get(array, True).__final = True
def read(self):
yield from self.__read([])
def update(self, array):
self[array].__final = True
def delete(self, array):
self[array].__final = False
def prune(self):
for key, value in tuple(self.__nodes.items()):
if not value.prune():
del self.__nodes[key]
if not len(self):
self.delete([])
return self
def __get(self, array, create):
if array:
head, *tail = array
if create and head not in self.__nodes:
self.__nodes[head] = Trie()
return self.__nodes[head].__get(tail, create)
return self
def __read(self, name):
if self.__final:
yield name
for key, value in self.__nodes.items():
yield from value.__read(name + [key])
Trie Data Structure can be used to store data in O(L)
where L is the length of the string so for inserting N strings time complexity would be O(NL)
the string can be searched in O(L)
only same goes for deletion.
Can be clone from https://github.com/Parikshit22/pytrie.git
class Node:
def __init__(self):
self.children = [None]*26
self.isend = False
class trie:
def __init__(self,):
self.__root = Node()
def __len__(self,):
return len(self.search_byprefix(''))
def __str__(self):
ll = self.search_byprefix('')
string = ''
for i in ll:
string+=i
string+='\n'
return string
def chartoint(self,character):
return ord(character)-ord('a')
def remove(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
raise ValueError("Keyword doesn't exist in trie")
if ptr.isend is not True:
raise ValueError("Keyword doesn't exist in trie")
ptr.isend = False
return
def insert(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
ptr.children[i] = Node()
ptr = ptr.children[i]
ptr.isend = True
def search(self,string):
ptr = self.__root
length = len(string)
for idx in range(length):
i = self.chartoint(string[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return False
if ptr.isend is not True:
return False
return True
def __getall(self,ptr,key,key_list):
if ptr is None:
key_list.append(key)
return
if ptr.isend==True:
key_list.append(key)
for i in range(26):
if ptr.children[i] is not None:
self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
def search_byprefix(self,key):
ptr = self.__root
key_list = []
length = len(key)
for idx in range(length):
i = self.chartoint(key[idx])
if ptr.children[i] is not None:
ptr = ptr.children[i]
else:
return None
self.__getall(ptr,key,key_list)
return key_list
t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)
True
False
['minakshi', 'minhaj']
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi
This version is using recursion
import pprint
from collections import deque
pp = pprint.PrettyPrinter(indent=4)
inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}
def trie_recursion(trie_ds, word):
try:
letter = word.popleft()
out = trie_recursion(trie_ds.get(letter, {}), word)
except IndexError:
# End of the word
return {}
# Dont update if letter already present
if not trie_ds.has_key(letter):
trie_ds[letter] = out
return trie_ds
for word in words:
# Go through each word
trie = trie_recursion(trie, deque(word))
pprint.pprint(trie)
Output:
Coool <algos> python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
'b': {
'a': {
'r': {},
'z': {}
}
},
'f': {
'o': {
'o': {}
},
'u': {
'n': {}
}
}
}
This is much like a previous answer but simpler to read:
def make_trie(words):
trie = {}
for word in words:
head = trie
for char in word:
if char not in head:
head[char] = {}
head = head[char]
head["_end_"] = "_end_"
return trie
class Trie:
head = {}
def add(self,word):
cur = self.head
for ch in word:
if ch not in cur:
cur[ch] = {}
cur = cur[ch]
cur['*'] = True
def search(self,word):
cur = self.head
for ch in word:
if ch not in cur:
return False
cur = cur[ch]
if '*' in cur:
return True
else:
return False
def printf(self):
print (self.head)
dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")
print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()
Out
True
False
False
False
{'h': {'i': {'*': True}}}
Have a look at this:
https://github.com/kmike/marisa-trie
Static memory-efficient Trie structures for Python (2.x and 3.x).
String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.
Based on marisa-trie C++ library.
Here's a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/
At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.
...
I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.
What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.
There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance:
Source: Stackoverflow.com