More on Canonical Huffman coding As we have seen canonical Huffman coding allows a faster decompression, and a more efficient use of memory But ... until now we’ve supposed that code lengths are given. In order to prove the viability of canonical Huffman we need a way to compute the code lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Computing code lengths - I The efficiency of computing the code lengths affects only encoding performances, and encoding is less important than decoding However, when we use whole words as source symbols, it is common to have an alphabet composed of many thousand different symbols (words). In this case efficient solutions can make the difference w.r.t. the use of the traditional Huffman tree Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3 The problem Input data a file of n integers, where the i-th integer is the number of times symbol i appears in the text if c is the i-th integer in the file, the probability of i the i-th symbol is ci c n i 1 i ... only small changes if we have directly a file with the probabilities Output data Huffman code lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4 The key idea We use a heap data structure It is easy to find the smallest value, that is located at the root There is an elegant and efficient solution to store in memory the binary tree using a simple vector Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5 A closer view to the heap - I As we have already seen, the left child of a node in position i is in position 2i, while the right child si stored in location 2i+1. This means that the parent of a node in position k is in location k / 2 What is the depth of the heap if there are n elements? SOL. log n Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6 A closer view to the heap - II 2 8 10 8 21 17 11 23 12 20 How this heap is stored in an array? 2 8 10 8 17 23 12 21 11 20 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7 A closer view to the heap - III Removing the smallest item and reorganizing the heap 2 log n cost, 2 comparisons for each level of the tree 2 20 > 8 8 10 8 < 10 20 > 8 8 < 17 8 17 20 > 11 11 < 21 21 11 20 23 12 Construction of the heap It is possible to prove that the cost of constructing a heap from an unsorted list of n items requires about 2n comparisons By the way, how much does it cost, in the worst case, to sort the vector with one of the popular sorting algorithms? SOL. nlogn Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9 Computing code lengths - II The algorithm works constructing an initial array with 2n positions the last n positions store the frequencies of the symbols the first half of the array contains a heap in order to efficiently locate the symbol with le lowest frequency As entries are removed from the heap, freed space is used to store branches of the tree Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10 Computing code lengths - III Also frequencies are overwritten to store the pointers that constitute the tree At the end the array contains only a oneelement heap and the Huffman tree heap leaves (frequencies) n 1 heap 1 2n leaves & tree pointers h 2n tree pointers h=1 2n Computing code lengths: phase 1 Frequencies are read from the file and stored in the last n positions of the array (let’s call it A) Each position i , 1 i n points to the corrispondent frequency A[n+i] Then the first half of A is ordered in a heap with the method seen before In practice we must ensure that i 1 i n 2 A A i A A 2i A A i A A 2i 1 At the end A[1] stores m1: A[m1]=min{A[n+1...2n]} Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12 Computing code lengths: phase 2 h=n while h>1 { m1=A[1] -- take the root of the heap h=h-1 -- position A[h] is no more part of the heap “reorder the heap” m2=A[1] -- now m1,m2 point to the two smallest freq. A[h+1]=A[m1]+A[m2] -- new item is saved in pos. h+1 A[1]=h+1 -- the new element is pushed back in the heap A[m1]=A[m2]=h+1 -- smallest frequencies are discarded and changed into tree pointers “reorder the heap” } Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13 example 4 h 1 h+1 1 5 7 4 5 7 m1 m2 m1 9 h+1 1 7 m1 m2 9 1 h 7 m1 m2 Computing code lengths: phase 3 After n-1 iterations a single aggregate remains in the heap, in position 2 and A[1]=2, as this is the only item in the heap To find the deep in the tree of a particular leaf we can simply start from it and follow the pointers until we reach location 2. The number of pointers followed are the desidered length for n 1 i 2n { d=0; r=i; -- d is the counter, r is the current element while r>2 d=d+1; r=A[r]; -- follow the pointer to location 2 A[i]=d } -- now A[i] is the length of codeword i Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15 The cost of the algorithm first phase O(n) second phase knlogn a heap of n element is reordered about 2n times each reordering takes log h log n iterations, each of which has a constant cost third phase O(n2) in the worst case there is one iteration for each bit of each codeword how many total bits? uniform distribution n*logn bits... ... worst case: i-th symbol requires i bits to be coded total length i 1 i n n2 2 Revised third phase - I Note that nodes are added to the tree from position h toward position 2 for this reason all pointers of the tree are “from right to left” Then if we start from position 2 towards position 2n, when we find a node we’ve always already found its parent! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 17 Revised third phase - II So a very efficient algorithm (O(n)) to find code lengths is to start from position 2 that has length 0 and then to proceed towards position 2n labeling each position with the length associated to its parent augmented by 1 Then third phase becomes A[2]=0 for i=3 to 2n A[i]=A[A[i]]+1 -- A[A[i]] is the parent of A[i] Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18