Lempel-Ziv methods
Dictionary models - I



Dictionary-based compression methods use
the principle of replacing substrings in a
message with a codeword that identifies each
substring in a dictionary, or codebook
The dictionary contains a list of substrings and
their associated codewords
Unlike symbolwise methods, dictionary
methods often use fixed codewords rather
than explicit probability distribution
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
2
Dictionary models - II





For example, we can insert into the dictionary
the full set of 8-bit ASCII characters How many?
and the 256 most common pairs of characters
If we use fixed length codeword, how many
bits does we need to index dictionary entries?
SOL. 9 bits
What about the performances in bits/character
in the best and in the worst case?
SOL. best:4.5b/char worst:9b/char!!
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
3
Dictionary models - III



Another possibility is to use longer words in
the dictionary, perhaps common words like the
or and or common components of words like
tion. This strings are the phrases of the
dictionary
A dictionary with a predefined set of phrases
does not achieve good compression
Performances are better if we tune the
dictionary on input source, i.e. if we loose
input indipendence
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
4
Dictionary models - IV



For istance common phrases for an italian
sport newspaper are very rare in a business
management book
To avoid the problem of dictionary being
unsuitable for the text at hand we can build a
new dictionary for each message to be
compressed....
.... but there is a significant overhead for
transmitting and storing it

Deciding the size of the dictionary in order to
maximize compression is a very difficult problem
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
5
The Lempel-Ziv methods


The only efficient solution to the problem is to
use an adaptive dictionary scheme
Pratically all adaptive dictionary compression
methods are based on one of the two methods
developed by two israely researchers,
Abraham Lempel and Jacob Ziv in 1977 e
1978, and called LZ77 and LZ78

"A Universal Algorithm for Sequential Data Compression"
in the IEEE Transactions on Information Theory, May 1977
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
6
The key idea - I

The key insight of the method is that it is
possible to automatically build a dictionary of
previously seen strings in the text being
compressed

The prior text makes a very good dictionary, since it
has usually the same style and language of the
upcoming text
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
7
The key idea - II

The dictionary does not have to be transmitted
with the compressed text, since the
decompressor can build it the same way the
compressor does


The many variants of Lempel-Ziv methods differ in
how pointers are represented and in the limitations
on what the pointers are able to refer to
The presence of so many variants is also caused by
same patents, and by the disputes over patenting
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
8
The LZ77 family
+ Quite easy to implement
+ Fast decoding with little use of memory
 The output of the encoding consists of a series
of triples
 the
the
 the
 the
first component indicates how far back to look in
previously decoded text
second component is the length of the phrase
third is next character for the input
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
9
An example - encoding

alphabet {a,b}
ab a a b a b aabb
<0,0,a> <0,0,b> <2,1,a> <3,2,b> <5,3,b>
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
10
An example - decoding

<0,0,x><0,0,y><2,1,z><2,1,x><5,3,z>
<6,3,z><5,2,z>
SOL. x y xz xx yxzz xxyz zxz
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
11
A recursive example

<0,0,a><0,0,c><2,1,a><4,2,b><1,10,a>
a

c
aa
acb ?? bbbbbbbbbba
Despite the recursive references, each
character is available when needed
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
12
Further details on LZ77


LZ77 algorithm places limitations on how far
back a pointer can refer (i.e. on the length of
the first component of the triple) and on the
maximum size of the string referred to (i.e. on
the length of the second component)
For example, in English text there is no gain in
using a sliding windows of more than a few
thousand characters

We can use a windows of 8.192 characters, i.e. 13
bits
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
13
Further details on LZ77


At the same time, the length of the match is rarely
over 16 characters, so the extra cost to allow
longer match usually is not justified
Exercise: encode the sequence
010020$0110$$0111
with a sliding window of 7 symbols and a maximal
match length of 3. Calculate the compression ratio
SOL.
<0,0,0><0,0,1><2,1,0><0,0,2><2,1,$><7,2,1><5,2,$>,<6,3,1>
0000000 0000001 0100100 0000010 0100111 1111001 1011011 1101101
C=(17*2)/7*8=0.607 << 1!!!
14
LZ77 - encoding
Encode the text S[1..N] using LZ77, with a
sliding window of W characters
p=1
WHILE p<N {




search for the longest match for S[p...] in S[p-W ... p-1].
Suppose the match occurs at position p-m, with length l
output the triple < p-m, l, S[p+l ] >
p=p + l + 1 }
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
15
LZ77 - decoding
Decode the text S[1..N] using LZ77, with a
sliding window of W characters
p=1
FOREACH triple < f, l, c> {




S[p ... p + l - 1] = S[ p - f ... p – f + l - 1]
Suppose the match occurs at position p-m, with length l
S[p+l ] = c
p=p + l + 1 }
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
16
LZ77 - improvements

The LZ77 has been gradually refined



first component of the triple: it is useful to use
variable length, assigning shorter codewords to
recent matches (that are more common)
second component of the triple: variable length
codes that uses less bits to represent smaller
numbers
third component of the triple: in some variants it
is added only when needed (when?), with a 1-bit flag
to indicate the presence or the absence of this third
component
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
17
gzip algorithm - I



gzip is one of the more effective
variants of LZ77
It is distributed by the Gnu Free
Software Foundation
home page of gzip project:
www.gzip.org
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
18
gzip algorithm - II




gzip uses a hash tables: the next 3 characters
to be coded are hashed, and the return value
is used as index to lookup a table entry
This entry is the head of a list that contains
the places of occurrence of the 3 characters in
the window
The list is searched for the longest match
If there is no match the string is coded as raw
characters
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
19
gzip algorithm - III



If the match exists, we have a length and a
distance, otherwise we have a zero length and
a raw character
the sliding window has dimension W=32KB,
lengths are limited to 258 bytes
List lengths are limited to avoid time
consuming researches

tradeoff accuracy/time  user’s choice
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
20
gzip algorithm - IV


Lengths, distances and raw characters are
coded with two Huffman trees, one for
distances and the other for lengths and raw
characters
Huffman codes are generated processing
blocks of up to 64KB (with canonical Huffman)

so gzip it is not really one-pass. From a pratical point
of view it is one-pass because blocks are small, so
they are read only one time and kept in main
memory
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
21
gzip - example
abacbcaab
<length, distance/character>
<0,a><0,b><1,2><0,c><1,3><1,2><1,4><2,7>
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
22
gzip – best compression
abacbcaab ...... abcaab
<length, distance/character>
<0,a><0,b><1,2><0,c><1,3><1,2><1,4><2,7>
two solutions


ab + caab

a + bcaab
The first is greedy, it uses longest possible match. But
sometimes the second is better. If best compression is
selected, gzip takes a longer time but chooses the best of
the two, eventually coding raw characters even if matches
are possible, if it gives better compression in the long run
23
LZ78 family
- it has restrictions on which substring can be
referenced (but this avoids some inefficiency)
- decoding is slower than LZ77 and require
more memory
+ does not have a window to limit how far
back substring can be referenced
+ one of its variant, LZW, is widely used in
many popular compression systems
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
24
Referentiable strings



The text prior to the current coding position is
parsed in substrings, and only parsed phrases
can be referenced
Previous phrases are numbered in sequence,
and the output is a list of pairs
<previous phrase, next character>
This unseen combination is stored as a new
phrase
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
25
An example
abaababaa
Phrases
Output
0
1
2
3
4
5
<0,a> <0,b> <1,a> <2,a> <4,a>
<null>
a
b
aa
ba
baa
Only this phrases can be referenced

This avoids the inefficiency of having more than one coded
representation for the same string, as usual in LZ77
26
How to store the phrases


It is crucial for algorithm efficiency, to store
the phrases in a clever way
This can be obtained using a trie
a
Phrases
0
1
2
3
4
5
<null>
a
b
aa
ba
baa
0
b
1
2
a
a
3
4
a
5
b
6
27
How to store the phrases
a

0
b
1
2
a

a
3
4
a
5
b
7
baab
<5,b>
b
6


The character of each
phrase specify a path from
the root to a leaf
The character to be
encoded are used to
traverse the trie until the
path is blocked
The last node contains the
phrase number to output
A new node is added with
next input character, to
form a new phrase
28
A problem


The trie data structure continues to grow
during coding, and eventually growth must be
stopped to avoid an eccessive use of memory
There are various strategies



the trie can be reinitialized from scratch
the trie can be used as is, without further updates
the trie can be partially rebuild using last part of the
text (this avoids the penalties of starting form
scratch)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
29
LZ78 vs. LZ77


LZ78 encoding can be faster
LZ78 decoding is slower because the
decoder must also store the parsed
phrases
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
30
LZ78 - exercise


Code the sequence with LZ78 e show the trie
that store the phrases
0100101110101001011
SOL.
<0,0><0,1><1,0><2,0><2,1><4,1><1,1><3,1><7,1>
0
1
2
3
4
5
6
7
<null>
0
1
00
10
11
101
01
0
0
PHRASES
8
9
001
011
0
3
1
8
1
1
1
0
7
1
9
4
2
1
5
1
6
31
LZW variant - I



One of the most popular variants of LempelZiv coding (Welch 1984)
It forms the basis for Unix utility compress
and many other popular compressors
The main difference between LZW and LZ78 is
that LZW encodes only phrase numbers
without any ending characters

This scheme works fine because we initialize the
dictionary with a phrase for each character of the
source alphabet (e.g. the 256 characters of the 8-bit
ASCII)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
32
LZW variant - II


A new phrase is constructed from a coded one
by appending the first character of the next
phrase
Suppose we use 7-bit ASCII: dictionary is
initialized with 128 phrases (0-127)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
33
LZW - encoding
input:
a
b
a
ab
ab
ba
aba
abaa
output:
97
98
97
128
128
129
131
134
128 129 130
ab ba
aa
131
aba
132
abb
133
baa
134
abaa
new
phrases
added
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
34
LZW - decoding
input:
97
98
97
128
128
129
131
134
output:
a
b
a
ab
ab
ba
aba
aba?
abaa
it is not
ready!!
new
phrases
added
128 129 130
131
132
ab ba
a?
b? aa
a? aba
ab? abb
ab?
133
134
?
ba? abaa
baa
aba? abaa?
 The delay in phrase creation is not a problem unless the encoder
uses the phrase immediately after its creation. In this case, if
decoder inserts the phrases only when they are completed,
cannot decode, because it doesn’t have phrase 134
35
LZW - exercise

Code with LZW the sequence
0102$00$10111$02$
using 8-bit ASCII codes
Hint. 048, 149, 250, $36

SOL. 48 49 48 50 36 48 48 36 257 49 265 260 259
PHRASES
256
257
258
259
260
01
10
02
2$
$0
261
262
263
264
265
266
00
0$
$1
101
11
11$
267
268
$02
2$?
36
Lempel-Ziv methods: summary
LZ77
gzip
<pointer,length,character>
<length, distance/character>
LZ78
LZW
<phrase,character>
<phrase>
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
37
Lempel-Ziv methods: exercise

Code the message using all studied methods
abbb010cac0bb0abb10111b1a
using a sliding window of 7 bits and a max
match length of 7. No limit is given with
respect to the dictionary dimension
You can use the following 8-bit ASCII codes
a97 b98 c99 048 149
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005
38
Scarica

Lempel-Ziv methods