A note on /homes/ddailey/public_html/full.txt
| Abstract: The data provided through the FRELI project, contain lists of verified English words, as well as frequency data, and part-of-speech information. Some investigations of this list in contrast to others available is made. |
Primary information on the file is described at http://www.nkuitse.com/freli/. The file was created by Paul Huffman as a part of the FRELI project previously associated with the University of Michigan:
FRELI (the Free Repository of English Lexical Information) is an ongoing project whose principle product at this time is a list of approximately 36,000 English words with part of speech, frequency, and other information. [-- from the above URL]
The file contains not only words but also word frequency information (how commonly is the word used within the language?) and part of speech information. As such it is considerably more useful than a simple word list such as contained at /homes/share/words-english or /usr/dict/words. The words contained in the list have been crossed checked against the 1911 Roget's thesaurus (from which frequencies are derived) as well as the OED and other sources. The data is thus a bit more reliable than some of our other public domain sources.
A brief comparison of the three English word lists we have available is presented here:
$ wc $f $u $w
36531 147523 1084025 /homes/ddailey/public_html/full.txt
45402 45402 409048 /usr/dict/words
29141 29141 240169 /homes/share/words-english
111074 222066 1733242 total
While /usr/dict/words may have more entries, the data in /homes/ddailey/public_html/full.txt has considerably more information per line. Representative samples of the data are presented below:
| ddailey/public_html/full.txt | /usr/dict/words | /homes/share/words-english |
| $ tail -15000 $f|head -10 0 21361 newt (n) 1 21362 newtonian (adj) 10 252 next (adj,adv,prep) 1 21363 next-day (adj) 1 21364 next-door (adj) 0 21365 nexus (n) 1 21366 niaiserie (n) 4 21367 nib (n) 4 21368 nibble (n,v) 12 21369 nice (adj) |
$ tail -15000 $u|head -10 pi pianist piano pianos pica picas Picasso picayune Piccadilly piccolo |
$ tail -15000 $w|head -10 food foodstuff fool foolhardy foolish foolproof foolscap foot footage football |
data in the first column of the FRELI list represents the word's
frequency of occurrence
in the FRELI corpus (Roget's 1911 as provided through the Gutenberg project)
A few remarks about the idiosyncracies of the data:
117 3261 bear (n,v) ~ |
$ grep -c ^[A-Z] $u $w
/usr/dict/words:6784
/homes/share/words-english:5711Those in /homes/ddailey/public_html/full.txt do not:
$ grep [A-Z] /homes/ddailey/public_html/full.txt|wc -l
86
$ grep [0-9] $w |wc
157 157 1096words with numbers in them $ grep ^[^aeiouy]*$ $w |wc
370 370 1601words with no (lowercase
vowels)$ grep ^[A-Z]*$ $w|wc
207 207 846acronyms