Date: Thu, 20 Jun 1996 19:00:24 +0200 (MET DST) To: rusin@math.niu.edu (Dave Rusin) From: [Permission pending] Subject: Re: Compression and mathematics Hello Dave, >The main idea is just to notice that some letters or letter sequences >(e.g. 'the') occur more frequently than others (e.g. 'q'), so it makes >sense to store them with a non-ASCII convention that uses fewer bits >to store the common strings and more bits for the uncommon ones. For >example, you can stick to 1-letter sequences only, but arrange the >letters in order of frequency. Take enough letters from the top of the >list to total a frequency of about 1/2; assign them codes which begin >with the bit '0' and assign the others codes which begin with '1'. >Divide up each half likewise to decide on the second bit, and so on. >In this way, each letter gets a unique string, but the common letters >get shorter strings, and so the total length of the stored sequence is >less than using a 7-bit code for every letter. Thanks for replying. I do have a question, though. (I assume I understand the above text correctly. If it appears that that's not the case, please tell me!) As you assign more letters to one bit (like 'a,e,i,o,u' for first bit=0), how do you know *which* character you mean? Or if one character can be represented by 2 bits, and the other with 5 bits, how does the decompress-program know when a new character starts? Thanks for the help! [Permission pending] ============================================================================== Date: Thu, 20 Jun 96 12:56:00 CDT From: rusin (Dave Rusin) To: [Permission pending] Subject: Re: Compression and mathematics >As you assign more letters to one bit (like 'a,e,i,o,u' for first bit=0), >how do you know *which* character you mean? > >Or if one character can be represented by 2 bits, and the other with 5 >bits, how does the decompress-program know when a new character starts? Right, you have to make sure that the bit strings used for the encoding not only are distinct but are never subsets of each other. But this is easily done in practice. I don't have the real data to hand so let's make up some numbers. Suppose we have a 16-letter alphabet whose letters appear in a text with the following frequencies: a .18 b .13 c .12 d .11 e .09 f .07 g .06 h .05 i .04 j .03 k .03 l .03 m .02 n .02 o .01 p .01 I chose 16 letters to give the ASCII-type coding the best chance of success: you'd use all 4-bit codes with no wasted bits (normal ASCII is pretty inefficient, using 7 or 8 bits just for 26 distinct letters, or about 100 distinct characters). Let's compare to the proposed alternative. We'd give a,b,c, and d codes 0*, then rest 1*. Iterating the construction in the first set suggests a -> 000 b -> 001 c -> 010 d -> 011 In the second set, I guess e,f,g,h get second digit 0, the rest 1. Naturally I can just use e -> 1000 f -> 1001 g -> 1010 h -> 1011 for the first bunch, but for the rest I notice i,j,k take up about half the remaining distribution, so I give them a 0 next and the rest a 1. That is, I use i -> 1100 j -> 11010 k -> 11011 and then split off l and m from the rest: l -> 11100 m -> 11101 followed by n -> 11110 o -> 111110 p -> 111111 Now let me try a random (honest!) string and try to decode: 010010111010100101101010101001011011011100101001000 Among the characters starting with 0 I see two with a 1 next, so I look to the third digit: 010 can only be c. Then I look at the remaining string 010111... and see another c. Peeling that off as well we continune: 010 010 11101 010 010 11010 1010 1001 011 011 011 1001 010 010 00 c c m c c j g f d d d f c c The last two digits don't correspond to any letter's code. Incidentally, consider the expected length of an encoded message. Here again are the frequencies, now with the bit length of the code used: a .18 3 b .13 3 c .12 3 d .11 3 e .09 4 f .07 4 g .06 4 h .05 4 i .04 4 j .03 5 k .03 5 l .03 5 m .02 5 n .02 5 o .01 6 p .01 6 The expected number of bits used per character is thus 3.42 -- better than the 4.0 you would need for a fixed 4-bit code. In natural languages the letters tend not to be used equally often, which permits this kind of savings. When you look at two-or-more letter combinations, the savings are even better because the distributions are even more skewed when you already know one letter (e.g. many letters are used with similar frequencies, but when you just had a "b" you're _much_ more likely to have a vowel than a consonant, except for a few consonants of medium likelihood [b, l, r, s].) When looking at longer strings in this way one can achieve compression rates you see quoted in comparisons (circa 50%). Various other tricks can be used too -- e.g. adjusting codes mid-file since certain words might be used for a few paragraphs, then dropped. Smart programs also try to recognize file-type and adjust compression scheme accordingly (for text, executable, image, sound, etc.) Of course, if your file is just a million random bits, then you can't really compress much. It's a fundamental logical limitation: no lossless compression scheme can hope to have a compression ratio of better than 1:1 on all strings of a given length (there is no one-to-one function from a set of cardinality 2^n to a set of cardinality 2^(n-1) or smaller.) Lossy compression schemes (e.g. JPEG image compression) is another matter of course. Practical programs also must address the ability to limit data loss in cases of transmission failure (e.g. notice that one missing bit, above, ruins the whole code). Other topics include encoding with distinctive character sets (UUENCODE-style), encryption (possible e.g. with PKZIP) and of course decryption, program efficiencies and techniques (e.g. self-compressing programs), and natural language analysis. dave