English Gigaword language model training recipe

Home > Software > Gigaword LM training recipe

This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers.

Requirements:


By default, the scripts use: interpolated, modified Kneser-Ney smoothing, bigram cutoff 3, trigram cutoff 5, with an unknown word. I used the top 5K, 20K, and 64K words occurring in the training text as vocabularies. Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built.

Here are some OOV% and perplexity results measured on three different held-out evaluation test sets:

  • giga - held-out portion of Gigaword training data (~25M words)
  • setasd - text from the CSR LM-1 setaside dir (~15M words)
  • csr - held-out portion of CSR LM-1 training data from my CSR LM-1 training recipe (~6M words)

Note that the CSR test sets are from different newswire sources and cover a different historical period than Gigaword.

Vocab Punc Size giga
OOV%
giga
ppl
setasd
OOV%
setasd
ppl
csr
OOV%
csr
ppl
Zip file
5K NVP 2-gram 12.5 123.7 12.0 132.5 12.1 130.9 Download
20K NVP 2-gram 4.1 199.7 3.7 215.9 3.8 213.9 Download
64K NVP 2-gram 1.7 238.6 1.1 264.9 1.2 262.5 Download
5K VP 2-gram 11.4 96.4 10.8 103.6 11.0 104.0 Download
20K VP 2-gram 3.8 141.8 3.4 151.2 3.5 153.0 Download
64K VP 2-gram 1.8 161.4 1.2 176.2 1.2 178.9 Download
5K NVP 3-gram 12.5 82.0 12.0 91.1 12.1 89.7 Download
20K NVP 3-gram 4.1 121.3 3.7 138.5 3.8 136.4 Download
64K NVP 3-gram 1.7 144.6 1.1 170.1 1.2 167.5 Download
5K VP 3-gram 11.4 62.0 10.8 67.8 11.0 67.9 Download
20K VP 3-gram 3.8 79.3 3.4 88.2 3.5 88.7 Download
64K VP 3-gram 1.8 90.3 1.2 103.1 1.2 104.1 Download

Files:
lm_giga_recipe.zip Gigaword training recipe
readme.txt Readme file (contained in the above zip)