Saturday, 8 August 2015

Trie data structure




  • Trie is a multi-way search tree where each node represents each character and all the children of the same will be having a same prefix.
 


  • Rather than storing the character each node will be storing the prefix.
  • Structure of the Node :
                   String prefix;
Node[26] children;
Each node will hold an array of 26 nodes where each node represents the each    letter. If some letter is not present then that will be empty. It is an overhead of trie because it used extra memory which is not using.


  • Now we have to think about how to organize or store the array of nodes.
     
   Two possibilities are 1) Store as array
                                     2) Store as linked list


If you store as array we can access it in a indexed way,but we have to waste much memory.


if you store as linked list then access time of the search string will increase( search in linked list of 26 nodes is O(26) and we need to travel m levels where m is the length of search string, that sums to O(m*26)) but memory usage will be optimal.


But if we use HashMap then you can get the time complexity O(1) and memory also will be optimal O(n).



  • Compressed Trie


In the above image of the trie, in the case of “inn” word we can see that no other word sharing the latter “i” but still we need to make each node with the prefix. Consider the word “intermediate” where there will be more than 10 node memory has to use without sharing anything. So we can reduce the memory by storing the complete word if there is no other children present for that node. So in the case of compressed trie  the word “inn” will be in a single node.


  • Where do you use trie ?


  • Uses in web search !!
each word will be stored in the trie while crawling and leaf node of the trie will have the detail list of urls where that word is present.
if you want to search ” java and coffee” it will search for both words in trie and get the url list from trie leaf node in O(m) time.Now take the intersection of both urllist.


  • Uses in routers !!
In routers the source and destinations are stored as IP addresses as lookup table where all the computers connected to that router are listed. No when a packet comes router has to route this packet to the relevant destination by searching in the table. It is implemented in trie for search in O(4) [ ip address will be 10.168.3.2 ].


Different types of tries


  • Pattern matching !!


  1. Problem : if you want to search a string from a text, then we want to preprocess the text so that string can be find in O(m) where m is the length of string that is to be searched.
  2. For preprocess the text we will make a trie data structure and store the words in the text in the trie data structure.


Say i have a text of size T to preprocess and make it into in the form of a trie. How can we do this ? Take each word from the text and try adding in the trie. And store the position of the word in the leaf node. It will look like this.


Here the word “tea” presents in the 3rd position in the text, like that we can find out the location of each word in the text. If multiple occurrences are there then there will be multiple locations will be there in the leaf node.



Space complexity of a trie ??


If there is T number of characters are there in the text then total number of space required is O(T) without considering the pointer and node memories.If n words with average size of m then space complexity will be O(n*m). Roughly space complexity will be the total number of characters present in the input text.


Problems comes when :


  1. One word comes as a prefix of another word .eg : “te” and “ted” are two words in the text.
  2. i want to match a pattern “tea ten” which consist of two words, then i have to search two words differently and have to find all the occurrences that is happening consecutively.


Better way to do ??


     Yeah...here comes suffix trie , a variation of trie where we are storing all the suffix of the text that we want to preprocess. Say if we want to process the word “hello” then there will be 5 suffixes in this word (hello,ello,llo,lo,o). Store these occurrences in the trie and this is called as suffix trie. And each leaf node will be having the starting number of the suffix.


   Now we can retrieve each word in the text or pattern in the text as prefix to any of these suffixes that we have added to the trie.Eg : i wanted to search “ell” in the text then
it will match up in the “ello” suffix, then the start location can be found by moving to the leaf node.


Space complexity
At worse case it will take O(n^2) space complexity without compaction. Eg : “Halo” here the suffixes are (Halo,alo,lo,o) , the total number of nodes that are needed is 1+2+3+4 = 10 which is (n* n+1) which comes in the order of O(n^2). So this becomes very big data structure , we can reduce it to O(n) by compaction still we need to store internal characters which makes again costly.


Time complexity for create
Time complexity is also O(n^2) since it has to find out all the possible prefixes.


Time complexity for search


It is O(m+k) where m is the length of the string to be searched and k is the number of occurrences of the string to be searched in the text.










Java implementation of trie data structure


No comments:

Post a Comment