From: "Earl F. Glynn" Newsgroups: sci.math,sci.math.num-analysis Subject: Re: Frequency of English Alphabet Date: 4 Jan 1999 06:11:09 GMT Wilson Figueroa wrote in message <368F9323.51BB@aznet.net>... >I was reading some books on crpyptography and read about statistical >attacks on ciphers. > >I was wondering if anyone has a complete list for all 26 letters of the >English alphabet and how often these arise in normal writing (for >example, the letter E is the most often used letter in the English >language and one source said it has a 0.18 probability of occurring). >What is the complete list for all 26 letters? Here's the "One-Gram Probability Distribution" from Alan G. Konheim's "Cryptography -- A Primer," John Wiley, 1981, p. 16: A 0.0856 B 0.0139 C 0.0279 D 0.0378 E 0.1304 F 0.0289 G 0.0199 H 0.0528 I 0.0627 J 0.0013 K 0.0042 L 0.0339 M 0.0249 N 0.0707 O 0.0797 P 0.0199 Q 0.0012 R 0.0677 S 0.0607 T 0.1045 U 0.0249 V 0.0092 W 0.0149 X 0.0017 Y 0.0199 Z 0.0008 The Two-Gram Probability Distribution (p. 16) is also very interesting. efg _________________________________ efg's Computer Lab: www.efg2.com/lab efg's Technical Book Store: www.efg2.com/lab/TechBooks Earl F. Glynn E-Mail: EarlGlynn@att.net Overland Park, KS USA ============================================================================== Addenda [djr] Here are the letters, now listed according to the frequency of usage in English. Adjoined is their encoding in International Morse Code. 0.1304 E . 0.1045 T - 0.0856 A .- 0.0797 O --- 0.0707 N -. 0.0677 R .-. 0.0627 I .. 0.0607 S ... 0.0528 H .... 0.0378 D -.. 0.0339 L .-.. 0.0289 F ..-. 0.0279 C -.-. 0.0249 M -- 0.0249 U ..- 0.0199 G --. 0.0199 Y -.-- 0.0199 P .--. 0.0149 W .-- 0.0139 B -... 0.0092 V ...- 0.0042 K -.- 0.0017 X -..- 0.0013 J .--- 0.0012 Q --.- 0.0008 Z --.. ================ Boy scout handbook claims "the frequency in which we use [the letter] in the English language" is: ETAOINS HRDLUCM PFWVYB GJQKXZ and provides the Morse code equivalents shown. I don't know the specific connection intended by the inventors of the code, but the pattern is clear. We can sort the letters this way: L1 < L2 if (codelength1) < (codelength2), OR (codelength1) = (codelength2) and (# dots)_1 > (# dots)_2, OR (codelength1) = (codelength2) and (# dots)_1 = (# dots)_2, and code_1 < code_2 lexicographically (where "." < "-"); the first two rules attempt to encapsulate the idea that the letters near the top are faster to transmit. Here then is the ordering. It loosely follows the frequency table. It is difficult to find a rationale which allots "O" the code "---" and "K" the "nicer" code "-.-" when O is used 19 times as often as K ! (--- -.-? :-) ) 0.1304 E . 0.1045 T - 0.0627 I .. 0.0856 A .- 0.0707 N -. 0.0249 M -- 0.0607 S ... 0.0249 U ..- 0.0677 R .-. 0.0378 D -.. 0.0149 W .-- 0.0042 K -.- 0.0199 G --. 0.0797 O --- 0.0528 H .... 0.0092 V ...- 0.0289 F ..-. 0.0339 L .-.. 0.0139 B -... 0.0199 P .--. Note: skipped ..-- and .-.- 0.0017 X -..- 0.0279 C -.-. 0.0008 Z --.. 0.0013 J .--- 0.0199 Y -.-- 0.0012 Q --.- Last two possible codes (in this ordering), namely "---." and "----", are not used. Codes (5-character) also exist for digits and punctuation. I don't think there is a way to signify case. ================ Braille anyone? Standard English Braille Cell List taken from http://dots.physics.orst.edu/gs_bs_seb.html [dot 1] a [dot 1 2] b [dot 1 4] c [dot 1 4 5] d [dot 1 5] e [dot 1 2 4] f [dot 1 2 4 5] g [dot 1 2 5] h [dot 2 4] i [dot 2 4 5] j [dot 1 3] k [dot 1 2 3] l [dot 1 3 4] m [dot 1 3 4 5] n [dot 1 3 5] o [dot 1 2 3 4] p [dot 1 2 3 4 5] q [dot 1 2 3 5] r [dot 2 3 4] s [dot 2 3 4 5] t [dot 1 3 6] u [dot 1 2 3 6] v [dot 2 4 5 6] w [dot 1 3 4 6] x [dot 1 3 4 5 6] y [dot 1 3 5 6] z * Punctuation marks: [dot 2] , [dot 2 3] ; [dot 2 5] : [dot 2 5 6] . [dot 2 3 5] ! [dot 3 5 6] ? (at end of word) [dot 3] ' [dot 36] - [dot 2 3 6] open double quote, when at beginning of word [dot 3 5 6] close double quote [dot 2 3 5 6] parenthesis, ( when on left, ) when on right * Full-word signs: [dot 1 2 3 4 6] and [dot 1 2 3 4 5 6] for [dot 1 2 3 5 6] of [dot 2 3 4 6[ the [dot 2 3 4 5 6] with * Letter combinations: [dot 1 6] CH [dot 1 2 6] GH [dot 1 4 6] SH [dot 1 4 5 6] TH [dot 1 5 6] WH [dot 1 2 4 6] ED [dot 1 2 4 5 6] ER [dot 1 2 5 6] OU [dot 2 4 6] OW [dot 2 6] EN [dot 3 5] IN [dot 3 4] ST [dot 3 4 5] AR [dot 3 4 6] ING * Indicators, always appear in combination with other cells: [dot 3 4 5 6] number indicator (also used as internal contraction for "ble") [dot 6] capitalization and internal contraction indicator [dot 5 6] letter and internal contraction indicator [dot 5] second internal contraction indicator [dot 4 6] third internal contraction indicator [dot 4 5 6] general contraction indicator [dot 4 5] second general contraction indicator [dot 4] accented letter indicator (follows letter) A slight correlation with frequency-of-use: [dot 1] a [dot 1 2] b [dot 1 3] k [dot 1 4] c [dot 1 5] e [dot 2 4] i [dot 1 2 3] l [dot 1 2 4] f [dot 1 2 5] h [dot 1 3 4] m [dot 1 3 5] o [dot 1 3 6] u [dot 1 4 5] d [dot 2 3 4] s [dot 2 4 5] j [dot 1 2 3 4] p [dot 1 2 3 5] r [dot 1 2 3 6] v [dot 1 2 4 5] g [dot 1 3 4 5] n [dot 1 3 4 6] x [dot 1 3 5 6] z [dot 2 3 4 5] t [dot 2 4 5 6] w [dot 1 3 4 5 6] y [dot 1 2 3 4 5] q ================ Oh, what the heck. One more comment on letter frequency: a mnemonic, heard from d.j.e.nunn@durham.ac.uk (Douglas Nunn) : Elephants' toenails are orange, not red, I suspect. Helen drives Lorna's Ford Cortina. My uncle George's yellow Peugeot went because Vicky kept x-raying Jonathan's queer zebra.