STATISTICS ABOUT THE COLLECTION
* Note:Data are collected with a number of pedestrian UNIX tools, so that *
* non-conforming files (for example) will interfere with correct counts; *
* but different investigations corroborate the numbers within 10%, in general.*
As of Tue Jan 12 12:37:57 CST 1999:
Results from hit-counters indicate approximately 38 000 hits from early
1995 through the end of 1998; now, hits come in to the welcome pages
about once every every ten minutes. (Currently, the hit counters only
record accessions of the initial welcome/index pages; thus each
visitor will be counted at most once per web session -- and not at all
if they have a direct URL to some other page. This has not uniformly
been the case, and some of the hits are the result of me reviewing the
files!)
==============================================================================
SIZE OF COLLECTION
==============================================================================
(using du; ls -1 | wc; grep ^From\:\ * | wc; wc * | grep total )
directory files posts lines words bytes
93_back/ 79 185 14K 93K 628K
94/ 71 220 11K 72K 488K
95/ 425 878 52K 355K 2412K
96/ 285 506 32K 199K 1395K
97/ 271 409 28K 181K 1267K
98/ 586 1022 65K 415K 2944K
99/ 0 0 0 0 1K
All 9*/: 1717 3220 202K 1315K 9135K
index/ 134 -- 24K 125K 1184K
collection/ 41 -- 10K 74K 1048K
images/ 80 -- -- -- 251K
welcome.html 1 -- (164) 1K 13K
TOTAL PUBLIC: 1973 3220 236K 1515K 11631K
The directories 9* hold the "Selected Topics" files; index/ holds the
index and navigation pages, with the images in a separate directory;
collection/ holds this file and other information about the site.
(There are also private housekeeping and search-tool database directories;
the grand total is 13.7Meg).
There are segments of the "topics" files which are neither mail nor posts,
and lack a "From:" line; these are missed in the tallies above. (Indeed, some
41 ( grep -c ^From\:\ 9*/* | grep \:0 | wc ) "topics" files seem to have
no author at all; they are computer programs, .tex files, etc.) Some topics
are listed in two index pages; there are 1732 ( grep ^\
The content files constitute about
> 1 100 files,
> 2 041 items,
> 135 218 lines,
> 846 287 words,
> 5 402 612 bytes,
>The index pages contribute
> 102 files,
> 14 530 lines,
> 71 014 words,
> 638 918 bytes
==============================================================================
CHARACTERISTICS OF CONTENT FILES
==============================================================================
Of these content files, we can check which ones do or don't contain certain
kinds of information.
***Did I have any effect on files: 1031 no (so 686 yes, plus the index files
etc. One should add a portion of the files with no "From:" lines)
( grep -c -i rusin 9*/* | grep -c \:0 )
This includes 788 posts and email I wrote, and 415 emails received.
( grep ^From\:\ 9*/* | grep -i rusin | wc )
( grep ^To\:\ 9*/* | grep -i rusin | wc )
***Newsgroups: Only 176 files have no Newsgroups: line (so 1541 do).
These contain (excerpts of) 2591 posts.
( grep -c ^News 9*/* | grep \:0 | wc)
( grep ^News 9*/* | wc )
Current traffic on USENET: approximately 150(?) messages per day in sci.math
alone across last 12 months, and more in subsidiary newsgroups. Thus
this collection represents much less than 1% of the recent postings in the math
newsgroups.
The process of seeking permission from authors revealed an
unduplicated count of about 500 authors by Spring 1996. About a dozen
had unintelligible addresses and mail to maybe 50 more bounced as
undeliverable (host or user unknown). Total count of authors is unknown.
An upper bound is 2816 -- not very different from the 3220 items total!
( grep ^From\: 9*/* | sort | uniq | wc )
Ages of dated "items"
9 from 1990 ( grep ^Date 9*/* | grep -c 90 )
14 from 1991
79 from 1992
187 from 1993
225 from 1994
910 from 1995
528 from 1996
423 from 1997
1096 from 1998
0 from 1999
==============================================================================
STRENGTHS BY SUBJECT AREA
==============================================================================
We can estimate the _number_ of files to be retrieved for each area.
(Note: A few files are mentioned in more than one index page.)
( grep -c \"\\\.\\\.\/9 [0-9]*l )
In some cases it's easier to lump together subareas like this:
( cat 11*html | sort | uniq | grep -c \"\\\.\\\.\/9 )
Here are the counts per subject area: (total should be about 1717)
00 10
01 8
03 29
04 14
05 77
06 3
08 0
11 331 ie, 20% of the files are in number theory
12 55 + 20% in abstract algebra
13 40 + 20% in geometry and topology
14 114 + 20% in a few areas of fairly good coverage
15 52 (combinatorics, logic, computational math)
16 16 + 20% in many areas of poor coverage
17 0 (analysis, applications, statistics)
18 2
19 2
20 71
22 3
26 39
28 20
30 15
31 1
32 3
33 16
34 24
35 5
37 0
39 13
40 23
41 17
42 5
43 3
44 5
45 0
46 11
47 3
49 3
51 112
52 97
53 13
54 54
55 36
57 73
58 6
60 37
62 35
65 64
68 68
70 10
73 1
74 0
76 4
78 6
80 4
81 1
82 3
83 1
85 0
86 6
90 15
91 0
92 8
93 3
94 18
97