From: Nick Yao Zhen <wayer@chinaren.com>
Subject: An Optimized Implementation of LZW Algorithm
Date: Tue, 15 Feb 2000 18:27:45 +0000
Newsgroups: comp.compression,sci.math,sci.physics
Summary: [missing]

+================================================+
|  An Optimized Implementation of LZW Algorithm  |
+------------------------------------------------+
| Author : Nick Yao Zhen <wayer@chinaren.com>    |
| Date   : 02/14/2000 (mm/dd/yyyy)               |
+================================================+


                       -------Notes---------

This document gives a brief overview of the LZW algorithm in order
to make readers fully understand the ENCODING process, and hence
discuss a method independently developed by the author which could
give a better speed performance of the encoding algorithm.

The method that the author developed might have already been invented
by other intellgents.  The author fully appreciate the works done by
other inventors.

FREELY DISTRIBUTABLE AS LONG AS REPRODUCED COMPLETELY.

Copyright 2000 Nick Yao Zhen

.....................................................................

------------------
[1] INTRODUCTION
------------------

1.1 The LZW Algorithm

The LZW scheme is named by its searchers, i.e. Abraham Lempel, Jacob 
Ziv and Terry Welch.  It works by entering phrases into a dictionary 
and then when a repeat occurrence of that particular phrase is found,
it outputs the dictionary index instead of the phrase.

LZW is patented in USA. (Patent number 4,558,302) It is covered by
Unisys Corporation and CompuServe.  You may get a limited licence by
writing to the following address:

Welch Licencing Department
Office of the General Counsel
M/S C1SW19
Unisys Corporation
Blue Bell
Pennsylvania, 19424, U.S.A

LZW has been the most well-known Lempel-Ziv data compression scheme.
It was also introduced into the GIF graphics format.  Here we give a
simple example to illustrate how the algorithm works.

Note : LZW starts with a 4K dictionary of 0-255 entries referring to 
the ASCII character set.  Normally it has two more entries,  256
indicates the reinitialization of the dictionary and 257 represents the
End-Of-Information signal.  Each time a new phrase was encountered,
LZW appends the phrase into the dictionary.  (In most cases 258-4095 is
for new phrases)

Source string:  /WED/WE/WEE/WEB

Character Input    Output    Dictionary
   /W                 /       258 = /W
   E                  W       259 = WE
   D                  E       260 = ED
   /                  D       261 = D/
   WE                259      262 = /WE
   /                  E       263 = E/
   WEE               262      264 = /WEE
   /W                263      265 = E/W
   EB                259      266 = WEB
   <EOF>             257      ---------            

The algorithm of LZW could be described as following:

    set w = NIL
    loop
        read a character K
        if wK exists in the dictionary
            w = wK
        else 
            output the code for w
            add wK to the dictionary
            w = k
     until end of information

You may find out a problem from above: how can represent 256 or any
number above 256 within 1 byte?  If we use two bytes, then for two-
byte long phrases, the LZW algorithm does compress at all!   The
common solution is, when number of entries in the dictionary reaches
2^N, the algorithm automatically use N+1 bits for output the index 
number, e.g. when it reaches 512, use 10 bits.


1.2 Dictionary

Because LZW is a dictionary-based data compression scheme, so the 
dictionary operation is essential in the performance of the algorithm.
The performance normally refers to the 

  # Processing Speed 
  # Compression ratio

The first thing you may wonder is why SHOULD the standard LZW scheme
has an initial dictionary of 256 or 258 entries.  Assume the algorithm
begins with NO initial dictionary, with a source string of "ABABAB..."
given.  The output result is astonishing.

Source string:  ABABABABAB.....

Character Input    Output    Dictionary
   A                  0       0 = A
   B                  1       1 = AB
   A                  2       2 = BA
   BA                 3       3 = ABA 
   BAB                4       4 = ABAB
   AB                 5       5 = BAB
   ABA                6       6 = BABA
   BABA               7       7 = ABABA
   ...

When the number of entries was in the range of 0-1, it only uses 1 bit
to encode. in the range 2-3, it uses 2 bits, then 4-7 3 bits, so 
altogether "ABABABABABABABABA", a 17-byte-long string could be encoded
into 18 bits.  Obviously the algorithm with no initial dictionary can
achieve better compression ratio.

Another disadvantage of the initial dictionary is, in compressing text
files, almost half of the dictionary entries will not be used.  As a 
result, you are actually expanding 1 unnecessary bit each time you 
output an index number.

There must be some kind of reason for the standard LZW scheme to have
that initialy dictionary.  That reason is SPEED.  Because every time a
new byte is read the LZW has to perform a searching in the dictionary.
So most of the processing time is consumed during the searching.  If 
we don't have an ASCII ordered dictionary, we have to search for every
SINGLE byte.


1.3 Improve of the dictionary operation

The most important thing we can do to accelerate the speed performance
of LZW encoding procedure is to make the dictionary searching as quick
as possible. In David Bourgin's document "Introduction to the losslessy
compression schemes - Description of the source files and the methods"
He raised three possible ways to optimize the dictionary operation
  
  * To limit the dictionary to an high amount of words
    4K size will enable you to process a stream of 7,370,880 bytes.
  * To use a dictionary of less than 258 if possible
    In processing 16-color pictures without half-byte packing, this is
    a good plan.  It may also work with text files.
  * To reinitialize a dictionary with high frequency entries only
    This is a good strategy especially in graphic files.

However, in those three cases, we cannot avoid linear list search, 
which means we have to search the phrases ONE BY ONE.  Next part of the
document will give a new method to organize the dictionary, which could
give the LZW encoding scheme a better speed performance.


--------------------------------
[2] A Tree Organized Dictionary
--------------------------------

Before introducing the algorithm, the reader must have a good knowledge
of data structure.  In this algorithm, a tree (NOT binary tree) is 
introduced to represent the whole dictionary.

We mentioned the basic encoding algorithm in the INTRODUCTION PART. (If
readers don't mind, I will paste it again)

    set w = NIL
    loop
        read a character K
        if wK exists in the dictionary
            w = wK
        else 
            output the code for w
            add wK to the dictionary
            w = k
     until end of information

You should notice that every item added into the dictionary MUST have
a prefix which is ALREADY in the dictionary (w is the prefix).  That
means in a dictionary, if "ABCD" exists, then "A", "AB", "ABC" MUST
also exist, or vice versa.  So the structure of "ABCD" is somehow like:
  
   A-B-C-D

Assume we have a alphabet set I={'A', 'B', 'C', 'D'}, and a dictionary
of
 Index  Phrase
   0       A 
   1       B
   2       C
   3       D
   4       AA
   5       AAB
   6       BA
   7       BAB
   8       AAC
   9       AB
  10       CD
                   [Table 2.1]

We can generate a tree which is equivalent to the dictionary list.

                  +-B(5) 
            +-A(4)+-C(8)
      +-A(0)+-B(9)
      |
      +-B(1)+-A(6)+-B(7)
      |
      +-C(2)+-D(10)
      |
      +-D(3)   

       ROOT | 1-L | 2-L
                           [Table 2.2]
 The numbers in braket is the Index of the Phrase (see table 2.1)
 
Because we built all ABCD branches at the root, so for a given prefix
W, we can quickly index it into the next level.  For example, if we
encountered a phrase "ABC", in the old algorithm, we have to search for
"AB" first, the program will perform 6 compare operation until it get
"AB" in the list. Then the program will continue to search for "ABC",
(According to the rule, "ABC" should come after "AB", so we can just 
continue to search) it will take extra one, so altogether 7 compares.

If we use the tree structure to search, first we enter branch A at root
without any compare.  Then just two compares to reach branch B, because
there are no sub-branches under B, so we now know that "ABC" is a new
phrase with only 2 compare operations! 

As illustrated in the table 2.2, we have to add the index number into
every node, so after we reach our final target, we just read the index
from  node and output it to the destination string.  We also have to 
tell the program how many branches under one level.  In the tree 
structure, because no phrases would be developed upon 256 and 257, so
these two special freaks will NOT be included in the tree, they will 
be handled by the program.

A simple data structure discription is given below 

 typedef struct LZW_tree{
       char        number_of_branches;
       int         index;        
       LZW_tree  **Branches;
    } A_Tree;

/* For the root, the LZW_tree.index is no use, 
   so set as -1 if you like. */
 
With this tree-structured dictionary, we can reduce the searching time
significantly.  You should already noticed that even the dictionary is
organized differently, we still can't avoid linear list searching. (In
every level of the tree we still do linear search) In worst cases, it
could produce as many compare operations as if it were in the linear
list!  So the next idea popped into my mind is, if we can search the
branches which has higher occurrence first, we can optimize the speed
once again.


--------------------------------------
[3] Occurrence Based Sorting (O.B.S.)
--------------------------------------

In the final section of the Introduction part we mentioned a high-
frequency-survive strategy.  Which means a phrase in dictionary which
has high occurrence will stay when the dictionary is fully occupied.
Items with poor occurrence will be killed.  (This let me think of 
genetic algorithm ;-)  The OBS is somehow similar to that strategy,
but used in a totally different way.  As the name implies, it is an
algorithm used to sorting with respect to the occurrence.

There is a basic example of OBS:

Given an alphabet set I={'A','B','C','D'}, and I-based string 
"ABAADDBAA".  The normal way to sort the alphabet occurrence is to 
obtain the occurrence first and then use sorting algorithms like
bubble or binary sorting.  But OBS is different, it sorts while 
counting the occurrence.  The basic algorithm can be express like:


    loop
       Read a character K
       search for the position n of K in the occurrence list
       Increase the position n's occurrence o[n] by 1
       if o[n] > o[n-1] then
             swap o[n] and o[n-1]
    until end-of-information

  Source Code "ABAADDB"
  Character read        Occurrence list
    (initial)          A(0) B(0) C(0) D(0)
        A              A(1) B(0) C(0) D(0)
        B              A(1) B(1) C(0) D(0)
        A              A(2) B(1) C(0) D(0)
        A              A(3) B(1) C(0) D(0)
        D              A(3) B(1) D(1) C(0)
        D              A(3) D(2) B(1) C(0)
        B              A(3) D(2) B(2) C(0)
                              [TABLE 3.1]

See, sorted.  But I would say I can't guaranteed it would work out
exact sorted list.  In some occasions, it can only produce an
approximately sorted list.  You might noticed that a item in list can 
only get compared when it occurrs in the input stream, and it can only
compare for once!  So the algorithm sometimes can't work out exactly 
sorted result just because the item didn't get enough chance to reach
its deserved position.  But baring in mind that if an alphabet has a 
higher occurrence, it definitely will have more chance to be compared.
For those have zero occurrence, they only have the "chance" to get 
lowered down.

The advantage of OBS is, it is dynamic.  It will automatically respond
to the change of occurrence, though it will take a fair amount of time.
In language text compressing, the occurrence of certain alphabet is 
likely to stay steady after a long stream was processed, so OBS would 
sort the list perfectly.

Another aspect make me think OBS is suitable for LZW is the tree 
structure.  You might notice that under one branch there aren't many
sub-branches.  That is absolutely a good news for the OBS algorithm,
because the distance for an item to travel to its deserved place is 
shorter. 

We can still apply the high-frequency-survive strategy with the OBS,
and even much easier.  Because of the nature of LZW dictionary, you may
find that a branch's occurrence equals to the sum of sub-branches 
occurrences.  Because whenever a branch is reached, that means some
thing new is going to be added into that branch.

         +-B(3)
   +-A(4)+-A(1)

 (Some thing like that...)

You may consider the occurrence of a branch as the WEIGHT of the node.
When dictionary entries reaches the maximum limit, we may just simply
cut off the light branches instead of cutting off every item.  (But
don't cut off the root branches) The tree organization and the OBS work
together perfectly.

The final thing we have to do is to upgrade the tree structure. We have
to add the occurrence variable into it.

 typedef struct LZW_tree{
       char        number_of_branches;
       int         index;
       long        Weight;        
       LZW_tree  **Branches;
    } A_Tree;


---------------
[4] Conclusion
---------------

We finally come to this part!  Together we had a brief overlook on the
standard (and non-standard) LZW scheme, and I have explained two of
techniques I developed for the LZW speed optimization.  By the time
I write out this document, I am in my final year in college before 
going to a university.  The study and revision are intensive so I did
not write any optimized LZW compressors based upon tree structure and
OBS. (That's maybe a part of the reason I write it, -- hoping someone
can do this job for me!)  If you are interested in these algorithms
and accidentally made the implementation, or you have already developed
similar ideas, please do contact me.  Also for spelling mistakes and 
further development, I am eager to hearing from you.

Another idea I am have now is, why should we sort EVERY branch?  For
some branches with just 2 or 3 sub-branches we don't have to sort them,
-- maybe just simply add up the occurrence.  That's a good topic for us
to discuss about.

I do apologize for my English, though have been studying in England for
almost five months.  Maybe it is something I should improve in 
the university.


---------------
[5] References
---------------

1. Blackstock, Steve "LZW and GIF Explained"
   http://www.cs.usyd.edu.au/~loki/cs2csys/gif-info/explain-lzw.txt

2. Bourgin, David <dbourgin@ufrima.imag.fr> "Introduction to the 
   losslessly compression schemes - Description of the source files and
   the methods"  12-10-1995

3. Gutmann, Peter <pgut1@cs.aukuni.ac.nz> "Introduction to data
   compression", 
   http://www.faqs.org/faqs/compression-faq/part2/section-1.html
   10-29-1999

4. Yao, Zhen <wayer@chinaren.com> "Principal of Data Compression" 
   (Language : Chinese), 12-23-1999


-----------------
 End Of Document
-----------------