Compressed Trie Tree

Hello, people! In this post, we will discuss a commonly used data structure to store strings, the Compress Trie Tree, also known as Radix Tree or Patricia (Practical Algorithm to Retrieve Information Coded in Alphanumeric) Tree. If you remember, the problem with a Trie Tree is that it consumes a lot of memory. So, due to its memory consumption, developers prefer using a compressed trie tree to get the same functionality at the same runtime complexity in memory-critical situations such as in android apps.

“Compress”-ing the tree

Now, the idea of the compressed trie tree is to convert long chains of single-child edges to one single edge. Example, suppose we have two words “facebook” and “facepalm” in our regular trie tree, it would look like –

Trie Tree

It’s a pretty long one, isn’t it? Now, how about this one below?

compressed trie tree

This one surely looks a lot compact! This is the compressed trie tree. As you can see what we do here is to make long chains of single-child edges into one edge. Unlike a regular trie tree, in a compressed trie tree, the edge labels are strings.

Node in Compressed Trie Tree

For a regular trie tree, our tree node looked something like this –

class Node {
    Node[] children = new Node[26];
    boolean isWordEnd;
}

So, in every node, we had an array of references. The first references corresponded to ‘a’, the second to ‘b’, and so on. So, essentially, we had a mapping of alphabets to references. We had a way of saying, “An edge ‘a’ is denoted by this particular element in the array of references”. Now, in a compressed trie tree, the edges are labeled by strings. Now, we need a way of saying, “An edge ‘face’ is denoted by this particular element in the array of references”. To accomplish this, we re-design our tree node as follows –

class Node {
    Node[] children = new Node[26];
    StringBuilder[] edgeLabel = new StringBuilder[26];
    boolean isEnd;
}

So, what we did is that we added an additional array of Strings along with the array of references. Now edgeLabel[0] will denote the string starting with ‘a’, edgeLabel[1] will denote the string starting with ‘b’, and correspondingly, children[0] will denote the edge with the label edgeLabel[0].

Example, in the above diagram, the root node will have edgeLabel[5] = “face” and children[5] will point to the internal node. The internal node will have edgeLabel[1] = “book” and children[1] will point to the leaf node which will denote the occurrence of the word “facebook”. The same internal node will also have edgeLabel[15] = “palm” and children[15] will point to the leaf node which will denote the occurrence of the word “facepalm”. The rest of the values of edgeLabel and children in the internal node will be null.

The above code is written in Java. For Java, it is much better to use StringBuilder rather than String because we would be doing a lot of string manipulation. Using String will heavily slow down your program. If you are not familiar with StringBuilder, you can refer to my post.

insert() operation

All operations in the compressed trie tree are similar to what we would do in a regular trie tree. Insert operation is the one which will differ the most. In the insert operation, we need to take care of a few cases, they are –

  1. Inserting new node for a new word – This occurs when the starting character of the word is new and there’s no edge corresponding to it. This may occur at root, or after traversing to an internal node.compressed-trie-tree-4
  2. Inserting a prefix of an existing word – Inserting prefix into compressed trie tree
  3. Inserting a word which has a partial match to an existing edge – This occurs when we are trying to insert “this” when “there” is already inserted into the tree. Remember that “there” can have further children too, like if “thereafter” and “therein” are already inserted.breaking words during compressed trie tree insertion

So, for these cases, we would have to break the existing word or the newly inserted word accordingly. The faster we perform these string operations, the faster the insert operation will be.

search() operation

Searching in a compressed trie tree is much like searching. Here, instead of comparing a single character, we compare strings. The following cases will arise –

  • The string we want to search does not exist. Example, searching for “owl” when the tree only has “crow” in it.
  • The string we want to search exists as a prefix. Example, searching for “face” when the tree only has “facebook”.
  • Only the prefix of the target string exists. Converse of the previous case. Searching for “facebook” when the tree only has “face”.
  • The string we want to search matches partially with an existing string. Example, searching for “this” where the tree only has “there”.
  • Another case is when the edge label fully matches to the starting portion of the string. Example, searching for “thereafter” when “thereafter” and “therein” exist in the tree. For this, after a full match with “there”, we traverse to the node which corresponds to that label and then resume our search (searching for “after”).

If we are able to fully traverse the tree via the target string and arrive on a tree node, we check if that node is a word ending or not. If yes, we return true or, we return false. For rest of the cases, return false.

startsWith() operation

The startsWith() operation is a popular operation performed on the compressed trie tree which checks if there’s any word in the tree which starts with the target word. This method would be exactly as the searching method. The minor change with the search operation would be, in this operation, we will just check if we are able to fully traverse the tree via the target string and arrive on a node (which may be the root). If we can we return true, regardless of whether the current node is a word ending or not. This is because, even if it is not a word ending, its children will lead to a node which would be a word ending.

Printing the compressed trie tree

For each edge traversed, add its label to a variable and recursively do this for the child node. While traversing another edge, remove the previously added label and add the new label of the new edge traversing. Print the variable only if the current node is a word ending.

This recursive method should print all the words in the compressed trie tree in a lexicographical order.

Code

Start with your existing trie tree code, modify the node definition and then work on the insert method cases. Once you get the insert correctly, then the rest will work out easily. For the insert cases, you just have to do the string operations and change the references carefully. Try to code those cases. Come back and have a look at the diagrams if you need to.

You can check your code’s correctness with LeetCode’s Implement Trie problem. Try to solve that question using a compressed trie tree. Once you solve it, try to reduce the runtime of your program.

You can refer to my code if you get stuck. 🙂

    

This is the Java implementation. I will update this post with the C/C++ implementation soon.

In terms of runtime complexity, compressed trie tree is same as that of a regular trie tree. In terms of memory, a compressed trie tree uses very few amount of nodes which gives you a huge memory advantage especially for long strings with long common prefixes. In terms of speed, a regular trie tree would be slightly faster because its operations don’t involve any string operations, they are simple loops.

I hope my post has demystified everything regarding a compressed trie tree. Tutorial and code for a compressed trie tree are hard to find. I hope my post saved you the effort of finding further tutorials. Do comment below if you have any doubts. Keep practising! Happy Coding!! 😀

Trie Tree Implementation

Hello people…! In this post we will talk about the Trie Tree Implementation. Trie Trees are are used to search for all occurrences of a word in a given text very quickly. To be precise, if the length of the word is “L“, the trie tree searches for all occurrences of this data structure in O(L) time, which is very very fast in comparison to many pattern matching algorithms.

But I must mention, this data structure is not exactly used for “pattern matching”, it is used to search for the occurrences of the word in the given text. How these both functionalities differ…? We’ll get to know that shortly. The Trie Tree has many applications. Your browser could be using a trie tree internally to search for words when you press Ctrl + F. So, let’s get started with this data structure…!

The Trie Tree is a very straight forward data structure. It is a simple tree where the nodes have an alphabet associated with them. And the way these nodes are arranged is the best part of the Trie Tree. To get an idea take a close look at the sketch below –

Structure of Trie Tree

Structure of Trie Tree

The arrows in the sketch above indicate how to traverse the trie tree in order to tell if a word exists or not. We travel through the root node, down the tree until we hit a leaf. Picking up a character at every edge, we construct our word. Now, you can easily tell why the time to search for a node in the text will be in the order of length of the word to be searched. It’s obvious…! One would have to go down till the leaf. In the worst case, one would have to travel the height of the tree which would be the length of the longest word in the whole text..! 😉

So, as we got a little idea about working with a Trie Tree, let us move on to the serious part. Let us talk about how this is implemented and then we will talk about three fundamental operations done on the Trie Tree –

  • Insert
  • Delete
  • Search

In C, the Trie Tree is implemented using structures. But the C implementation of this data structure gets a little dirty. So we will switch to C++ but we will keep it as C-ish as possible. As we keep discussing about the implementation, you will notice how many advantages we have when we use C++. This will be our structure of each node in the tree –

struct node
{
    struct node * parent;
    struct node * children[ALPHABETS];
    vector<int> occurrences;
};
  • parent – This points to the node which is the parent of the current node in the tree hierarchy. It may seem useless at the first, but we need this to travel up the tree when we want to delete any word. We’ll get to the deletion operation in a moment.
  • children – It points to the children nodes. It is made an array so that we can have O(1) accessing time. But why is the size 26…? Its simple. Consider a word, “th”, now, what could the third letter possibly be, if it had one…? One among the 26 english alphabets…! So that’s why the size is made 26. If you want to make a trie tree for another language, replace the number 26 with the number of letters in your language.
  • occurrences – This will store the starting indices of the word occurring in the given text. Now, why this is made a vector is that, vector is as good as a Linked List with random access. It is one of the most handy ready made data structures available in the C++ STL Library. If you are not familiar with vectors this is a good place to start.
    If this were C, we would have to give a fixed array size, and we would have a problem if the occurrences of a particular node are more. We could avoid this by putting a Linked List there. But we sacrifice random access and a whole lot of operations get time taking. Moreover, the code will get really really cumbersome to manage if you have a tree and a linked list.

Having got a picture of the implementation, let us look at how the operations are done in a Trie Tree.

Insert Operation

When we do an insert operation, there are a few cases –

  1. The word to be inserted does not exist.
  2. The word to be inserted already exists in the tree.
  3. The word to be inserted does not exists, but as the suffix of a word.

The first case is simple. One would have to traverse till the alphabets of the words have nodes in the trie tree or else create new nodes one-after-the-other. And at the end of the word, i.e., the node for the last alphabet, we will mark it as a leaf and push the starting index into the vector indicating the occurrence of the newly inserted word.

During this course of traversal, we will be cutting off the string of the word we have one-by-one as they are processed. This is done by putting using a vector of characters and popping off one character after the other. This is less code to handle and more efficient as we can use a vector as a queue. This is another advantage of using C++.

After having learnt what to do with the first case, you can guess what we would have to do in the second case. We simply have to add a new value to the occurrences vector at the node corresponding to the last alphabet of the word. We can also know the number of occurrences in constant time, we simply return the size of the vector. This is another advantage of using C++.

To understand the challenge in the third case, let’s take a simple example. What would you do with your trie tree if you wanted to insert the word “face” if the word “facebook” is already there in your tree…? This is the third case. The answer to this is the occurrence vector itself. We simply push the starting index of the word into the vector of that node which corresponds to the last alphabet of the word to be inserted, in the above example, this would be the node of “e”. So, what really tells if there’s a word ending with the alphabet corresponding to the current node is the size of the vector.

So I hope you understand how important our vector is. A lot depends on it…!

Delete Operation

The deletion of a word in the trie tree is similar to the insertion, we have a few cases –

  • Error 404 : Word not found…!
  • Word exists as a stand-alone word.
  • Word exists as a prefix of another word.

If the word is not there at all, we don’t have to do anything. We just have to make sure that we don’t mess up the existing data structure…!

The second case is a little tricky. We would have to delete the word bottom-up. That is, we will delete that part of the word which is not a part of any other word. For example, consider the sketch above. If we were to delete “this”, we would delete the letters ‘i’ and ‘s’ as, ‘h’ is a part of another word. This keeps us away from distorting the data structure. If the word were existing multiple number of times we will simply remove the occurrence from the vector of the concerned node.

In the third case too, we will simply delete the occurrence of the word from the vector. We needn’t write a lot of code as we can use the functions in algorithm header file. This is another advantage of using C++.

Note – When we delete the occurrence of a word, we are not concerned about the validity of the indices stored as occurrences of other words. What I mean to say is, suppose we have 10 words. If we delete the 3rd word, the 5th word or the 9th word is supposed to become the 4rth and the 8th word as far as the original text is concerned. But we will not consider this. The data stored in the trie tree is not meant to me deleted or inserted. The Trie Tree is meant for processing the given text not to manipulate the given text.

Search Operation

The search operation is simple and is as we discussed when we began our discussion about the Trie Tree. We go down the tree traversing the nodes and keep “picking up characters” as we go. And the occurrences vector tells us if a word exists that ends with the alphabet associated with the current node, and if so, it gives us the indices of occurrences and also the number of occurrences.

Besides these basic operations, there is another very interesting operation that is done with the Trie Tree –

  • Lexicographical Sort – If we want to print all the words processed into the trie tree lexicographically, all we have to do is do a Preorder Walk on the tree. This will automatically print all the words in the lexicographical order or the dictionary order. This is due to the very structure and arrangement of nodes in a Trie Tree. Now, I’d like to share another interesting thing about pre-order walk in trees… The Pre-order walk works exactly as a Depth First Search (DFS) in graphs, i.e., the sequence in which both the algorithms visit the nodes is the same. Think about this for a while and word out the two algorithms on an example (you could take mine in the sketch above), and you can see why it is so. You will also notice why the printed words would be lexicographically sorted.

Now, having learned a lot about the trie tree, try coding it in C++. If you are uneasy with C++, you can try it in C, but make sure you try at least 3 times. Trying is very important. I don’t know if you are new to reading my posts, but I insist a lot on trying in every post of mine…! If you have succeeded, you’re fabulous…! If not, check out my code below any try figuring out how just close you were…!!

C++Java
    
    

The code is highly commented with explanation. It is well described, but if you have any doubts regarding the data structure or the code, feel free to comment them. I have used a few macros. The macro CASE indicates for which case the Trie Tree works. If we mention ‘A’ as the macro, the Trie Tree will work for upper case words only.

Other Implementations

The code is well tested against fairly large input. You can download the test case file here – Trie Tree Input (PDF). You can just clock your code for the insert operations. My code took 1.236 seconds to execute that loop which reads a word and inserts it into the Trie Tree. There are 5000 words in total. The last word in the input file is the word to be deleted.

If you think you can give a suggestion to make my code better, please do comment them too. I appreciate your suggestions. For those there who are struggling to get their code right, “Keep the effort going guys”…! Remember, you won’t learn anything if you keep writing Hello World program again and again. So, keep practising…! I hope my post has helped you in getting to know about the Trie Tree. If it did, let me know by commenting! Happy Coding…! 😀