Date: Thu, 20 Jun 1996 19:00:24 +0200 (MET DST)
To: rusin@math.niu.edu (Dave Rusin)
From: [Permission pending]
Subject: Re: Compression and mathematics

Hello Dave,

>The main idea is just to notice that some letters or letter sequences
>(e.g. 'the') occur more frequently than others (e.g. 'q'), so it makes
>sense to store them with a non-ASCII convention that uses fewer bits
>to store the common strings and more bits for the uncommon ones.  For
>example, you can stick to 1-letter sequences only, but arrange the
>letters in order of frequency. Take enough letters from the top of the
>list to total a frequency of about 1/2; assign them codes which begin
>with the bit '0' and assign the others codes which begin with '1'.
>Divide up each half likewise to decide on the second bit, and so on.
>In this way, each letter gets a unique string, but the common letters
>get shorter strings, and so the total length of the stored sequence is
>less than using a 7-bit code for every letter.

Thanks for replying. I do have a question, though. (I assume I understand
the above text correctly. If it appears that that's not the case, please
tell me!)

As you assign more letters to one bit (like 'a,e,i,o,u' for first bit=0),
how do you know *which* character you mean?

Or if one character can be represented by 2 bits, and the other with 5
bits, how does the decompress-program know when a new character starts?

Thanks for the help!
[Permission pending]
==============================================================================

Date: Thu, 20 Jun 96 12:56:00 CDT
From: rusin (Dave Rusin)
To: [Permission pending]
Subject: Re: Compression and mathematics

>As you assign more letters to one bit (like 'a,e,i,o,u' for first bit=0),
>how do you know *which* character you mean?
>
>Or if one character can be represented by 2 bits, and the other with 5
>bits, how does the decompress-program know when a new character starts?

Right, you have to make sure that the bit strings used for the encoding
not only are distinct but are never subsets of each other. But this
is easily done in practice.

I don't have the real data to hand so let's make up some numbers. Suppose
we have a 16-letter alphabet whose letters appear in a text with the
following frequencies:

a	.18
b	.13
c	.12
d	.11
e	.09
f	.07
g	.06
h	.05
i	.04
j	.03
k	.03
l	.03
m	.02
n	.02
o	.01
p	.01

I chose  16 letters to give the ASCII-type coding the best chance of
success: you'd use all 4-bit codes with no wasted bits (normal ASCII is
pretty inefficient, using 7 or 8 bits just for 26 distinct letters, or
about 100 distinct characters). Let's compare to the proposed alternative.

We'd give  a,b,c,  and  d  codes   0*, then rest 1*. Iterating the construction
in the first set suggests
	a -> 000
	b -> 001
	c -> 010
	d -> 011
In the second set, I guess   e,f,g,h  get second digit  0, the rest  1.
Naturally I can just use
	e -> 1000
	f -> 1001
	g -> 1010
	h -> 1011
for the first bunch, but for the rest I notice  i,j,k  take up about
half the remaining distribution, so I give them  a  0  next  and the rest
a  1. That is, I use
	i -> 1100
	j -> 11010
	k -> 11011
and then split off  l  and  m  from the rest:
	l -> 11100
	m -> 11101
followed by
	n -> 11110
	o -> 111110
	p -> 111111

Now let me try a random (honest!) string and try to decode:
010010111010100101101010101001011011011100101001000

Among the characters starting with  0  I see two with a  1  next, so I
look to the third digit:  010  can only be  c. Then I look at the
remaining string  010111... and see another  c. Peeling that off as
well we continune:

010 010 11101 010 010 11010 1010 1001 011 011 011 1001 010 010 00
c   c   m     c   c   j     g    f    d   d   d   f    c   c 

The last two digits don't correspond to any letter's code.

Incidentally, consider the expected length of an encoded message. Here again
are the frequencies, now with the bit length of the code used:

a	.18	3
b	.13	3
c	.12	3
d	.11	3
e	.09	4
f	.07	4
g	.06	4
h	.05	4
i	.04	4
j	.03	5
k	.03	5
l	.03	5
m	.02	5
n	.02	5
o	.01	6
p	.01	6
	
The expected number of bits used per character is thus  3.42  -- better
than the  4.0  you would need for a fixed 4-bit code.

In natural languages the letters tend not to be used equally often, which
permits this kind of savings. When you look at two-or-more letter
combinations, the savings are even better because the distributions are
even more skewed when you already know one letter (e.g. many letters are
used with similar frequencies, but when you just had a  "b"  you're _much_
more likely to have a vowel than a consonant, except for a few
consonants of medium likelihood [b, l, r, s].)  When looking at longer
strings in this way one can achieve compression rates you see quoted
in comparisons (circa 50%). Various other tricks can be used too --
e.g. adjusting codes mid-file since certain words might be used for a
few paragraphs, then dropped. Smart programs also try to recognize
file-type and adjust compression scheme accordingly (for text, executable,
image, sound, etc.)

Of course, if your file is just a million random bits, then you can't
really compress much. It's a fundamental logical limitation: no
lossless compression scheme can hope to have a compression ratio of
better than 1:1 on all strings of a given length (there is no
one-to-one function from a set of cardinality  2^n  to a set of
cardinality  2^(n-1)  or smaller.)

Lossy compression schemes (e.g. JPEG image compression) is another
matter of course. Practical programs also must address the ability to
limit data loss in cases of transmission failure (e.g. notice that one
missing bit, above, ruins the whole code). Other topics include
encoding with distinctive character sets (UUENCODE-style), encryption
(possible e.g. with PKZIP) and of course decryption, program efficiencies
and techniques (e.g. self-compressing programs), and natural language
analysis.

dave