From: Nick Yao Zhen Subject: An Optimized Implementation of LZW Algorithm Date: Tue, 15 Feb 2000 18:27:45 +0000 Newsgroups: comp.compression,sci.math,sci.physics Summary: [missing] +================================================+ | An Optimized Implementation of LZW Algorithm | +------------------------------------------------+ | Author : Nick Yao Zhen | | Date : 02/14/2000 (mm/dd/yyyy) | +================================================+ -------Notes--------- This document gives a brief overview of the LZW algorithm in order to make readers fully understand the ENCODING process, and hence discuss a method independently developed by the author which could give a better speed performance of the encoding algorithm. The method that the author developed might have already been invented by other intellgents. The author fully appreciate the works done by other inventors. FREELY DISTRIBUTABLE AS LONG AS REPRODUCED COMPLETELY. Copyright 2000 Nick Yao Zhen ..................................................................... ------------------ [1] INTRODUCTION ------------------ 1.1 The LZW Algorithm The LZW scheme is named by its searchers, i.e. Abraham Lempel, Jacob Ziv and Terry Welch. It works by entering phrases into a dictionary and then when a repeat occurrence of that particular phrase is found, it outputs the dictionary index instead of the phrase. LZW is patented in USA. (Patent number 4,558,302) It is covered by Unisys Corporation and CompuServe. You may get a limited licence by writing to the following address: Welch Licencing Department Office of the General Counsel M/S C1SW19 Unisys Corporation Blue Bell Pennsylvania, 19424, U.S.A LZW has been the most well-known Lempel-Ziv data compression scheme. It was also introduced into the GIF graphics format. Here we give a simple example to illustrate how the algorithm works. Note : LZW starts with a 4K dictionary of 0-255 entries referring to the ASCII character set. Normally it has two more entries, 256 indicates the reinitialization of the dictionary and 257 represents the End-Of-Information signal. Each time a new phrase was encountered, LZW appends the phrase into the dictionary. (In most cases 258-4095 is for new phrases) Source string: /WED/WE/WEE/WEB Character Input Output Dictionary /W / 258 = /W E W 259 = WE D E 260 = ED / D 261 = D/ WE 259 262 = /WE / E 263 = E/ WEE 262 264 = /WEE /W 263 265 = E/W EB 259 266 = WEB 257 --------- The algorithm of LZW could be described as following: set w = NIL loop read a character K if wK exists in the dictionary w = wK else output the code for w add wK to the dictionary w = k until end of information You may find out a problem from above: how can represent 256 or any number above 256 within 1 byte? If we use two bytes, then for two- byte long phrases, the LZW algorithm does compress at all! The common solution is, when number of entries in the dictionary reaches 2^N, the algorithm automatically use N+1 bits for output the index number, e.g. when it reaches 512, use 10 bits. 1.2 Dictionary Because LZW is a dictionary-based data compression scheme, so the dictionary operation is essential in the performance of the algorithm. The performance normally refers to the # Processing Speed # Compression ratio The first thing you may wonder is why SHOULD the standard LZW scheme has an initial dictionary of 256 or 258 entries. Assume the algorithm begins with NO initial dictionary, with a source string of "ABABAB..." given. The output result is astonishing. Source string: ABABABABAB..... Character Input Output Dictionary A 0 0 = A B 1 1 = AB A 2 2 = BA BA 3 3 = ABA BAB 4 4 = ABAB AB 5 5 = BAB ABA 6 6 = BABA BABA 7 7 = ABABA ... When the number of entries was in the range of 0-1, it only uses 1 bit to encode. in the range 2-3, it uses 2 bits, then 4-7 3 bits, so altogether "ABABABABABABABABA", a 17-byte-long string could be encoded into 18 bits. Obviously the algorithm with no initial dictionary can achieve better compression ratio. Another disadvantage of the initial dictionary is, in compressing text files, almost half of the dictionary entries will not be used. As a result, you are actually expanding 1 unnecessary bit each time you output an index number. There must be some kind of reason for the standard LZW scheme to have that initialy dictionary. That reason is SPEED. Because every time a new byte is read the LZW has to perform a searching in the dictionary. So most of the processing time is consumed during the searching. If we don't have an ASCII ordered dictionary, we have to search for every SINGLE byte. 1.3 Improve of the dictionary operation The most important thing we can do to accelerate the speed performance of LZW encoding procedure is to make the dictionary searching as quick as possible. In David Bourgin's document "Introduction to the losslessy compression schemes - Description of the source files and the methods" He raised three possible ways to optimize the dictionary operation * To limit the dictionary to an high amount of words 4K size will enable you to process a stream of 7,370,880 bytes. * To use a dictionary of less than 258 if possible In processing 16-color pictures without half-byte packing, this is a good plan. It may also work with text files. * To reinitialize a dictionary with high frequency entries only This is a good strategy especially in graphic files. However, in those three cases, we cannot avoid linear list search, which means we have to search the phrases ONE BY ONE. Next part of the document will give a new method to organize the dictionary, which could give the LZW encoding scheme a better speed performance. -------------------------------- [2] A Tree Organized Dictionary -------------------------------- Before introducing the algorithm, the reader must have a good knowledge of data structure. In this algorithm, a tree (NOT binary tree) is introduced to represent the whole dictionary. We mentioned the basic encoding algorithm in the INTRODUCTION PART. (If readers don't mind, I will paste it again) set w = NIL loop read a character K if wK exists in the dictionary w = wK else output the code for w add wK to the dictionary w = k until end of information You should notice that every item added into the dictionary MUST have a prefix which is ALREADY in the dictionary (w is the prefix). That means in a dictionary, if "ABCD" exists, then "A", "AB", "ABC" MUST also exist, or vice versa. So the structure of "ABCD" is somehow like: A-B-C-D Assume we have a alphabet set I={'A', 'B', 'C', 'D'}, and a dictionary of Index Phrase 0 A 1 B 2 C 3 D 4 AA 5 AAB 6 BA 7 BAB 8 AAC 9 AB 10 CD [Table 2.1] We can generate a tree which is equivalent to the dictionary list. +-B(5) +-A(4)+-C(8) +-A(0)+-B(9) | +-B(1)+-A(6)+-B(7) | +-C(2)+-D(10) | +-D(3) ROOT | 1-L | 2-L [Table 2.2] The numbers in braket is the Index of the Phrase (see table 2.1) Because we built all ABCD branches at the root, so for a given prefix W, we can quickly index it into the next level. For example, if we encountered a phrase "ABC", in the old algorithm, we have to search for "AB" first, the program will perform 6 compare operation until it get "AB" in the list. Then the program will continue to search for "ABC", (According to the rule, "ABC" should come after "AB", so we can just continue to search) it will take extra one, so altogether 7 compares. If we use the tree structure to search, first we enter branch A at root without any compare. Then just two compares to reach branch B, because there are no sub-branches under B, so we now know that "ABC" is a new phrase with only 2 compare operations! As illustrated in the table 2.2, we have to add the index number into every node, so after we reach our final target, we just read the index from node and output it to the destination string. We also have to tell the program how many branches under one level. In the tree structure, because no phrases would be developed upon 256 and 257, so these two special freaks will NOT be included in the tree, they will be handled by the program. A simple data structure discription is given below typedef struct LZW_tree{ char number_of_branches; int index; LZW_tree **Branches; } A_Tree; /* For the root, the LZW_tree.index is no use, so set as -1 if you like. */ With this tree-structured dictionary, we can reduce the searching time significantly. You should already noticed that even the dictionary is organized differently, we still can't avoid linear list searching. (In every level of the tree we still do linear search) In worst cases, it could produce as many compare operations as if it were in the linear list! So the next idea popped into my mind is, if we can search the branches which has higher occurrence first, we can optimize the speed once again. -------------------------------------- [3] Occurrence Based Sorting (O.B.S.) -------------------------------------- In the final section of the Introduction part we mentioned a high- frequency-survive strategy. Which means a phrase in dictionary which has high occurrence will stay when the dictionary is fully occupied. Items with poor occurrence will be killed. (This let me think of genetic algorithm ;-) The OBS is somehow similar to that strategy, but used in a totally different way. As the name implies, it is an algorithm used to sorting with respect to the occurrence. There is a basic example of OBS: Given an alphabet set I={'A','B','C','D'}, and I-based string "ABAADDBAA". The normal way to sort the alphabet occurrence is to obtain the occurrence first and then use sorting algorithms like bubble or binary sorting. But OBS is different, it sorts while counting the occurrence. The basic algorithm can be express like: loop Read a character K search for the position n of K in the occurrence list Increase the position n's occurrence o[n] by 1 if o[n] > o[n-1] then swap o[n] and o[n-1] until end-of-information Source Code "ABAADDB" Character read Occurrence list (initial) A(0) B(0) C(0) D(0) A A(1) B(0) C(0) D(0) B A(1) B(1) C(0) D(0) A A(2) B(1) C(0) D(0) A A(3) B(1) C(0) D(0) D A(3) B(1) D(1) C(0) D A(3) D(2) B(1) C(0) B A(3) D(2) B(2) C(0) [TABLE 3.1] See, sorted. But I would say I can't guaranteed it would work out exact sorted list. In some occasions, it can only produce an approximately sorted list. You might noticed that a item in list can only get compared when it occurrs in the input stream, and it can only compare for once! So the algorithm sometimes can't work out exactly sorted result just because the item didn't get enough chance to reach its deserved position. But baring in mind that if an alphabet has a higher occurrence, it definitely will have more chance to be compared. For those have zero occurrence, they only have the "chance" to get lowered down. The advantage of OBS is, it is dynamic. It will automatically respond to the change of occurrence, though it will take a fair amount of time. In language text compressing, the occurrence of certain alphabet is likely to stay steady after a long stream was processed, so OBS would sort the list perfectly. Another aspect make me think OBS is suitable for LZW is the tree structure. You might notice that under one branch there aren't many sub-branches. That is absolutely a good news for the OBS algorithm, because the distance for an item to travel to its deserved place is shorter. We can still apply the high-frequency-survive strategy with the OBS, and even much easier. Because of the nature of LZW dictionary, you may find that a branch's occurrence equals to the sum of sub-branches occurrences. Because whenever a branch is reached, that means some thing new is going to be added into that branch. +-B(3) +-A(4)+-A(1) (Some thing like that...) You may consider the occurrence of a branch as the WEIGHT of the node. When dictionary entries reaches the maximum limit, we may just simply cut off the light branches instead of cutting off every item. (But don't cut off the root branches) The tree organization and the OBS work together perfectly. The final thing we have to do is to upgrade the tree structure. We have to add the occurrence variable into it. typedef struct LZW_tree{ char number_of_branches; int index; long Weight; LZW_tree **Branches; } A_Tree; --------------- [4] Conclusion --------------- We finally come to this part! Together we had a brief overlook on the standard (and non-standard) LZW scheme, and I have explained two of techniques I developed for the LZW speed optimization. By the time I write out this document, I am in my final year in college before going to a university. The study and revision are intensive so I did not write any optimized LZW compressors based upon tree structure and OBS. (That's maybe a part of the reason I write it, -- hoping someone can do this job for me!) If you are interested in these algorithms and accidentally made the implementation, or you have already developed similar ideas, please do contact me. Also for spelling mistakes and further development, I am eager to hearing from you. Another idea I am have now is, why should we sort EVERY branch? For some branches with just 2 or 3 sub-branches we don't have to sort them, -- maybe just simply add up the occurrence. That's a good topic for us to discuss about. I do apologize for my English, though have been studying in England for almost five months. Maybe it is something I should improve in the university. --------------- [5] References --------------- 1. Blackstock, Steve "LZW and GIF Explained" http://www.cs.usyd.edu.au/~loki/cs2csys/gif-info/explain-lzw.txt 2. Bourgin, David "Introduction to the losslessly compression schemes - Description of the source files and the methods" 12-10-1995 3. Gutmann, Peter "Introduction to data compression", http://www.faqs.org/faqs/compression-faq/part2/section-1.html 10-29-1999 4. Yao, Zhen "Principal of Data Compression" (Language : Chinese), 12-23-1999 ----------------- End Of Document -----------------