Big English Word Lists

Home > Software > Big English Word Lists

I created a bunch of large English word lists by taking words that appeared in the intersection of 12 different word lists. I used the following sources for the word lists:

  • British national corpus (336K words)
  • Enron email corpus (135K words)
  • Moby word list (355K words)
  • CMU pronuciation dictionary (119K words)
  • W3C email corpus (89K words)
  • Wiktionary (218K words)
  • Wikipedia (top 400K words)
  • Gigaword newswire corpus (top 400K words)
  • LM-CSR newswire corpus (top 400K words)
  • Google corpus (top 400K words)
  • Westbury Lab Usenet corpus (top 400K words)
  • ICWSM 2009 blog corpus (top 400K words)

By varying the number of lists a word must appear in (from 1 to 12), I got word lists of varying size and "quality".

Update: In March 2018 I updated the words lists. Previously I used 10 word lists, but several had problems that caused some common words like "and" and words with apostrophes not to appear in the intersection involving 9 or 10 of the lists. In the process of fixing this, I removed the American national and the 20 newsgroups word lists. I added new word lists from blog, usenet, w3c, and wikitionary data. If you need the old lists for some reason, they are still available here.

Files:
wlist_all.zip All the word lists
wlist_match12.zip Words in 12 lists (27K words)
wlist_match11.zip Words in 11 lists (43K words)
wlist_match10.zip Words in 10 lists (60K words)
wlist_match9.zip Words in 9 lists (84K words)
wlist_match8.zip Words in 8 lists (111K words)
wlist_match7.zip Words in 7 lists (143K words)
wlist_match6.zip Words in 6 lists (181K words)
wlist_match5.zip Words in 5 lists (228K words)
wlist_match4.zip Words in 4 lists (289K words)
wlist_match3.zip Words in 3 lists (384K words)
wlist_match2.zip Words in 2 lists (587K words)
wlist_match1.zip Words in 1 list (1517K words)