More on Canonical Huffman coding


As we have seen canonical Huffman
coding allows a faster decompression,
and a more efficient use of memory
But ... until now we’ve supposed that
code lengths are given. In order to
prove the viability of canonical Huffman
we need a way to compute the code
lengths
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
2
Computing code lengths - I


The efficiency of computing the code lengths
affects only encoding performances, and
encoding is less important than decoding
However, when we use whole words as source
symbols, it is common to have an alphabet
composed of many thousand different symbols
(words). In this case efficient solutions can
make the difference w.r.t. the use of the
traditional Huffman tree
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
3
The problem

Input data


a file of n integers, where the i-th integer is the
number of times symbol i appears in the text
if c is the i-th integer in the file, the probability of
i
the i-th symbol is
ci


 c 
n
i 1 i
... only small changes if we have directly a file with
the probabilities
Output data

Huffman code lengths
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
4
The key idea

We use a heap data structure


It is easy to find the smallest value, that is
located at the root
There is an elegant and efficient solution to
store in memory the binary tree using a
simple vector
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
5
A closer view to the heap - I


As we have already seen, the left child of a
node in position i is in position 2i, while the
right child si stored in location 2i+1. This
means that the parent of a node in position k
is in location 
k / 2
What is the depth of the heap if there
are n elements?
SOL. log n 
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
6
A closer view to the heap - II
2
8
10
8
21

17
11
23
12
20
How this heap is stored in an array?
2
8
10
8
17 23 12 21 11 20
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
7
A closer view to the heap - III

Removing the smallest item and reorganizing
the heap
 2 log n  cost, 2 comparisons for each level of the


tree
2
20 > 8
8
10
8 < 10
20 > 8
8 < 17
8
17
20 > 11
11 < 21
21
11
20
23
12
Construction of the heap


It is possible to prove that the cost of
constructing a heap from an unsorted
list of n items requires about 2n
comparisons
By the way, how much does it cost, in the
worst case, to sort the vector with one of the
popular sorting algorithms?
SOL. nlogn
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
9
Computing code lengths - II

The algorithm works constructing an
initial array with 2n positions



the last n positions store the frequencies of
the symbols
the first half of the array contains a heap in
order to efficiently locate the symbol with le
lowest frequency
As entries are removed from the heap,
freed space is used to store branches of
the tree
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
10
Computing code lengths - III


Also frequencies are overwritten to store the
pointers that constitute the tree
At the end the array contains only a oneelement heap and the Huffman tree
heap
leaves (frequencies)
n
1
heap
1
2n
leaves & tree pointers
h
2n
tree pointers
h=1
2n
Computing code lengths: phase 1



Frequencies are read from the file and stored in
the last n positions of the array (let’s call it A)
Each position i , 1  i  n points to the
corrispondent frequency A[n+i]
Then the first half of A is ordered in a heap
with the method seen before

In practice we must ensure that
i 1  i  n 2

A  A i   A  A  2i   A  A i   A  A  2i  1
At the end A[1] stores m1: A[m1]=min{A[n+1...2n]}
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
12
Computing code lengths: phase 2
h=n
while h>1 {
 m1=A[1]
-- take the root of the heap
 h=h-1
-- position A[h] is no more part of the heap
 “reorder the heap”
 m2=A[1]
-- now m1,m2 point to the two smallest freq.
 A[h+1]=A[m1]+A[m2] -- new item is saved in pos. h+1
 A[1]=h+1 -- the new element is pushed back in the heap
 A[m1]=A[m2]=h+1 -- smallest frequencies are discarded
and changed into tree pointers

“reorder the heap” }
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
13
example
4
h
1
h+1
1
5
7
4
5
7
m1
m2
m1
9
h+1
1
7
m1
m2
9
1
h
7
m1
m2
Computing code lengths: phase 3


After n-1 iterations a single aggregate remains in the heap, in
position 2 and A[1]=2, as this is the only item in the heap
To find the deep in the tree of a particular leaf we can simply
start from it and follow the pointers until we reach location 2.
The number of pointers followed are the desidered length
for n  1  i  2n {
 d=0; r=i; -- d is the counter, r is the current element
 while r>2
d=d+1; r=A[r]; -- follow the pointer to location 2
 A[i]=d }
-- now A[i] is the length of codeword i
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
15
The cost of the algorithm


first phase  O(n)
second phase  knlogn



a heap of n element is reordered about 2n times
each reordering takes log h  log n iterations, each
of which has a constant cost
third phase  O(n2) in the worst case


there is one iteration for each bit of each codeword
how many total bits?


uniform distribution  n*logn bits...
... worst case: i-th symbol requires i bits to be coded
total length  i 1 i
n
n2 2
Revised third phase - I

Note that



nodes are added to the tree from position h
toward position 2
for this reason all pointers of the tree are
“from right to left”
Then if we start from position 2 towards
position 2n, when we find a node we’ve
always already found its parent!
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
17
Revised third phase - II


So a very efficient algorithm (O(n)) to find code
lengths is to start from position 2 that has
length 0 and then to proceed towards position
2n labeling each position with the length
associated to its parent augmented by 1
Then third phase becomes
A[2]=0
for i=3 to 2n

A[i]=A[A[i]]+1
-- A[A[i]] is the parent of A[i]
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006
18