[python] How to create a trie in Python

I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.

  • Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
  • Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
  • How to implement word-blocks consisting of more than one word separated with - or space?
  • How to link prefix or suffix of a word to another part in the structure? (for DAWG)

I want to understand the best output structure in order to figure out how to create and use one.

I would also appreciate what should be the output of a DAWG along with trie.

I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.

This question is related to python trie dawg

The answer is


Have a look at this:

https://github.com/kmike/marisa-trie

Static memory-efficient Trie structures for Python (2.x and 3.x).

String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.

Based on marisa-trie C++ library.

Here's a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.

...

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

There are also a couple of pure-python implementations, though unless you're on a restricted platform you'd want to use the C++ backed implementation above for best performance:


Modified from senderle's method (above). I found that Python's defaultdict is ideal for creating a trie or a prefix tree.

from collections import defaultdict

class Trie:
    """
    Implement a trie with insert, search, and startsWith methods.
    """
    def __init__(self):
        self.root = defaultdict()

    # @param {string} word
    # @return {void}
    # Inserts a word into the trie.
    def insert(self, word):
        current = self.root
        for letter in word:
            current = current.setdefault(letter, {})
        current.setdefault("_end")

    # @param {string} word
    # @return {boolean}
    # Returns if the word is in the trie.
    def search(self, word):
        current = self.root
        for letter in word:
            if letter not in current:
                return False
            current = current[letter]
        if "_end" in current:
            return True
        return False

    # @param {string} prefix
    # @return {boolean}
    # Returns if there is any word in the trie
    # that starts with the given prefix.
    def startsWith(self, prefix):
        current = self.root
        for letter in prefix:
            if letter not in current:
                return False
            current = current[letter]
        return True

# Now test the class

test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')

print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')

There's no "should"; it's up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.

I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.


If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:

class Trie:

    def __init__(self):
        self.__final = False
        self.__nodes = {}

    def __repr__(self):
        return 'Trie<len={}, final={}>'.format(len(self), self.__final)

    def __getstate__(self):
        return self.__final, self.__nodes

    def __setstate__(self, state):
        self.__final, self.__nodes = state

    def __len__(self):
        return len(self.__nodes)

    def __bool__(self):
        return self.__final

    def __contains__(self, array):
        try:
            return self[array]
        except KeyError:
            return False

    def __iter__(self):
        yield self
        for node in self.__nodes.values():
            yield from node

    def __getitem__(self, array):
        return self.__get(array, False)

    def create(self, array):
        self.__get(array, True).__final = True

    def read(self):
        yield from self.__read([])

    def update(self, array):
        self[array].__final = True

    def delete(self, array):
        self[array].__final = False

    def prune(self):
        for key, value in tuple(self.__nodes.items()):
            if not value.prune():
                del self.__nodes[key]
        if not len(self):
            self.delete([])
        return self

    def __get(self, array, create):
        if array:
            head, *tail = array
            if create and head not in self.__nodes:
                self.__nodes[head] = Trie()
            return self.__nodes[head].__get(tail, create)
        return self

    def __read(self, name):
        if self.__final:
            yield name
        for key, value in self.__nodes.items():
            yield from value.__read(name + [key])

Using defaultdict and reduce function.

Create Trie

from functools import reduce
from collections import defaultdict
T = lambda : defaultdict(T)
trie = T()
reduce(dict.__getitem__,'how',trie)['isEnd'] = True

Trie :

defaultdict(<function __main__.<lambda>()>,
            {'h': defaultdict(<function __main__.<lambda>()>,
                         {'o': defaultdict(<function __main__.<lambda>()>,
                                      {'w': defaultdict(<function __main__.<lambda>()>,
                                                   {'isEnd': True})})})})

Search In Trie :

curr = trie
for w in 'how':
    if w in curr:
        curr = curr[w]
    else:
        print("Not Found")
        break
if curr['isEnd']:
    print('Found')

This version is using recursion

import pprint
from collections import deque

pp = pprint.PrettyPrinter(indent=4)

inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}


def trie_recursion(trie_ds, word):
    try:
        letter = word.popleft()
        out = trie_recursion(trie_ds.get(letter, {}), word)
    except IndexError:
        # End of the word
        return {}

    # Dont update if letter already present
    if not trie_ds.has_key(letter):
        trie_ds[letter] = out

    return trie_ds

for word in words:
    # Go through each word
    trie = trie_recursion(trie, deque(word))

pprint.pprint(trie)

Output:

Coool <algos>  python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
  'b': {
    'a': {
      'r': {},
      'z': {}
    }
  },
  'f': {
    'o': {
      'o': {}
    },
    'u': {
      'n': {}
    }
  }
}

Python Class for Trie


Trie Data Structure can be used to store data in O(L) where L is the length of the string so for inserting N strings time complexity would be O(NL) the string can be searched in O(L) only same goes for deletion.

Can be clone from https://github.com/Parikshit22/pytrie.git

class Node:
    def __init__(self):
        self.children = [None]*26
        self.isend = False
        
class trie:
    def __init__(self,):
        self.__root = Node()
        
    def __len__(self,):
        return len(self.search_byprefix(''))
    
    def __str__(self):
        ll =  self.search_byprefix('')
        string = ''
        for i in ll:
            string+=i
            string+='\n'
        return string
        
    def chartoint(self,character):
        return ord(character)-ord('a')
    
    def remove(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                raise ValueError("Keyword doesn't exist in trie")
        if ptr.isend is not True:
            raise ValueError("Keyword doesn't exist in trie")
        ptr.isend = False
        return
    
    def insert(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                ptr.children[i] = Node()
                ptr = ptr.children[i]
        ptr.isend = True
        
    def search(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return False
        if ptr.isend is not True:
            return False
        return True
    
    def __getall(self,ptr,key,key_list):
        if ptr is None:
            key_list.append(key)
            return
        if ptr.isend==True:
            key_list.append(key)
        for i in range(26):
            if ptr.children[i]  is not None:
                self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
        
    def search_byprefix(self,key):
        ptr = self.__root
        key_list = []
        length = len(key)
        for idx in range(length):
            i = self.chartoint(key[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return None
        
        self.__getall(ptr,key,key_list)
        return key_list
        

t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)

Code Oputpt

True
False
['minakshi', 'minhaj']
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi


class Trie:
    head = {}

    def add(self,word):

        cur = self.head
        for ch in word:
            if ch not in cur:
                cur[ch] = {}
            cur = cur[ch]
        cur['*'] = True

    def search(self,word):
        cur = self.head
        for ch in word:
            if ch not in cur:
                return False
            cur = cur[ch]

        if '*' in cur:
            return True
        else:
            return False
    def printf(self):
        print (self.head)

dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")


print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()

Out

True
False
False
False
{'h': {'i': {'*': True}}}

from collections import defaultdict

Define Trie:

_trie = lambda: defaultdict(_trie)

Create Trie:

trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
    curr = trie
    for c in s:
        curr = curr[c]
    curr.setdefault("_end")

Lookup:

def word_exist(trie, word):
    curr = trie
    for w in word:
        if w not in curr:
            return False
        curr = curr[w]
    return '_end' in curr

Test:

print(word_exist(trie, 'cam'))

This is much like a previous answer but simpler to read:

def make_trie(words):
    trie = {}
    for word in words:
        head = trie
        for char in word:
            if char not in head:
                head[char] = {}
            head = head[char]
        head["_end_"] = "_end_"
    return trie

Here is a list of python packages that implement Trie:

  • marisa-trie - a C++ based implementation.
  • python-trie - a simple pure python implementation.
  • PyTrie - a more advanced pure python implementation.
  • pygtrie - a pure python implementation by Google.
  • datrie - a double array trie implementation based on libdatrie.