From: Phil.Norman@genedata.com (Phil Norman) Subject: Re: Is "LZW" the most efficient theoretically possible compression technique? Date: 5 Jan 2000 11:43:17 +0100 Newsgroups: comp.compression,sci.math,sci.physics On Mon, 03 Jan 2000 16:09:07 -0500, DSM wrote: > >Is LZW the best which can be done? No, most definately not. Due to the way LZW encodes information, it wastes, on average, half a bit for every code it outputs. This is because it always outputs an integer number of bits for every code, and since it's creating these codes on the fly, it cannot output any codes whose values are higher than that of the code it's creating at the time. Therefore there are always some values which can be encoded in the number of bits the encoder is using, but will never actually be encoded because the relevant dictionary entries do not exist yet. Since these values are implicitly known to the decompressor, the ability to output such values is a waste of bits. Bolting on an adaptive huffman encoder (or an arithmetic encoder) would help. >A practical mind would insist "once redundancy >in a data set has been cut by the LZW (or similar) >compression algorithm, you can not shorten the >set without loss of data." I'd disagree with this. If a further compressor is specifically designed to work on LZW-encoded data (and therefore knows how the LZW algorithm works), then it is quite possible for it to compress the data further, if only by exploiting the fact (mentioned above) that LZW is inherently redundant in its use of fixed-length output codes (fixed-length on a short-term basis, that is). It's helpful to realise that there are several different types of compression. Two of the basic forms are pattern-matching and value prediction. LZW is an example of pattern-matching, and huffman and arithmetic encoding are both forms of prediction (they predict that the most frequently-used values will be used most often, and so assign such values shorter codes). It is possible to generate a data-stream which contains no repeated patterns, but a very strong difference in frequency between different characters, just as it is possible to create a data-stream with the opposite characteristics. Generally speaking, using a combination of both of these types of compression yields better results than just using one. LZW is simply a pattern-matching compressor, whereas zip deflate (for example) uses both pattern matching (an LZ77 variant), and value prediction (huffman). >However, could an algorithm that searches for extremely >subtle patterns in the data set triumph over conventional >(LZW and others) compression techniques? Generally speaking, taking a compressed file and trying to compress it again is not going to work. However, writing a better compressor than LZW (even if this is done simply by adding further stages of compression to the LZW algorithm, like huffman or arithmetic coding) is not hard. Cheers, Phil