English Gigaword Language Model Training Recipe
-----------------------------------------------
http://www.keithv.com/software/giga/

This is a recipe to train word n-gram language models using 
the newswire text provided in the English Gigaword corpus.
It also prepares dictionaries needed to use the LMs with the 
HTK and Sphinx speech recognizers.

After conditioning, the total words from each source was 
(non-verbalized punctuation, excluding start/end words):
  NYT     642M
  APW     290M
  AFE     142M
  XIE     112M
  Total  1186M 
About 50M words were held out from the training data to 
serve as eval and dev test sets.

Requirements:
  Linux computer with Perl installed
  LDC English Gigaword corpus (only tested with 1st edition)
  SRILM language modeling toolkit
  A fair bit of disk space and memory

Basic steps:

1) Set GIGA_DATA to point to the directory in the Gigaword
   corpus above the afe, apw, nyt, and xie directories that
   contain the *.gz training files.
     GIGA_DATA=/rd/corpus/gigaword/;export GIGA_DATA
   
2) Install SRILM toolkit, make sure the binaries are on
   your path.

3) Download the CMU dictionary and put it in the recipe 
   directory under the name "c0.6".
     
4) A full set of LMs using 5K, 20K, 64K vocabs and both 
   verbalized (VP) and non-verbalized punctuation (NVP)
   can be built by running "go_all.sh"

   Defaults used by script to build LM:
     * Interpolated, modified Kneser-Ney smoothing
     * 2-gram cutoff 3, 3-gram cutoff 5
     * Built with unknown word <unk>
     * Both verbalized punctuation (VP) and non-verbalized
       punctuation (NVP) LMs are built
  
   Note that the text conditioning steps are particularly
   time consuming, plan on it taking a few days.  I did the 
   best I could to cleanup the data- removing garbage,  
   redundant articles, cricket scores, etc.  I used some of
   the scripts/programs provided in the CSR LM-1 corpus
   written by MIT/LDC/David Graff as well as some stuff I 
   cooked up myself.

Results:
Here are some OOV% and perplexity results measured on three different
held-out evaluation test sets:
  * giga     - held-out portion of Gigaword training data (~25M words)
  * setasd   - text from the CSR LM-1 setaside dir (~15M words)
  * csr      - held-out portion of CSR LM-1 training data from my 
               CSR LM-1 training recipe (~6M words)
Note that the CSR test sets are from different newswire sources and
cover a different historical period than Gigaword.

+-------+------+--------+------+-------+-------+-------+-------+-------+
| Vocab | Punc | Size   | giga | giga  | setasd| setasd| csr   | csr   |
|       |      |        | OOV% | ppl   | OOV%  | ppl   | OOV%  | ppl   |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | NVP  | 2-gram | 12.5 | 123.7 | 12.0  | 132.5 | 12.1  | 130.9 |
| 20K   | NVP  | 2-gram |  4.1 | 199.7 |  3.7  | 215.9 |  3.8  | 213.9 |
| 64K   | NVP  | 2-gram |  1.7 | 238.6 |  1.1  | 264.9 |  1.2  | 262.5 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | VP   | 2-gram | 11.4 |  96.4 | 10.8  | 103.6 | 11.0  | 104.0 |
| 20K   | VP   | 2-gram |  3.8 | 141.8 |  3.4  | 151.2 |  3.5  | 153.0 |
| 64K   | VP   | 2-gram |  1.8 | 161.4 |  1.2  | 176.2 |  1.2  | 178.9 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | NVP  | 3-gram | 12.5 |  82.0 | 12.0  |  91.1 | 12.1  |  89.7 |
| 20K   | NVP  | 3-gram |  4.1 | 121.3 |  3.7  | 138.5 |  3.8  | 136.4 |
| 64K   | NVP  | 3-gram |  1.7 | 144.6 |  1.1  | 170.1 |  1.2  | 167.5 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | VP   | 3-gram | 11.4 |  62.0 | 10.8  |  67.8 | 11.0  |  67.9 |
| 20K   | VP   | 3-gram |  3.8 |  79.3 |  3.4  |  88.2 |  3.5  |  88.7 |
| 64K   | VP   | 3-gram |  1.8 |  90.3 |  1.2  | 103.1 |  1.2  | 104.1 |
+-------+------+--------+------+-------+-------+-------+-------+-------+

The above used a bigram cutoff of 3 and a trigram cutoff of 5.  
Here are some results using NVP LMs with lower cutoffs of 1 and 3:
+-------+------+--------+------+-------+-------+-------+-------+-------+
| Vocab | Punc | Size   | giga | giga  | setasd| setasd| csr   | csr   |
|       |      |        | OOV% | ppl   | OOV%  | ppl   | OOV%  | ppl   |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | NVP  | 2-gram | 12.5 | 114.9 | 12.0  | 123.7 | 12.1  | 122.1 |
| 20K   | NVP  | 2-gram |  4.1 | 194.1 |  3.7  | 210.4 |  3.8  | 208.5 |
| 64K   | NVP  | 2-gram |  1.7 | 233.1 |  1.1  | 259.7 |  1.2  | 257.3 |
+-------+------+--------+------+-------+-------+-------+-------+-------+
| 5K    | NVP  | 3-gram | 12.5 |  72.7 | 12.0  |  81.4 | 12.1  |  80.0 |
| 20K   | NVP  | 3-gram |  4.1 | 113.5 |  3.7  | 130.5 |  3.8  | 128.5 |
| 64K   | NVP  | 3-gram |  1.7 | 134.8 |  1.1  | 159.9 |  1.2  | 157.2 |
+-------+------+--------+------+-------+-------+-------+-------+-------+

Notes:
The scripts use 293 out of the 314 original text data files 
for training, 7 (2 APW, 2 XIE, 2 NYT, 1 AFE) for development
test data, and 7 (2 APW, 2 XIE, 2 NYT, 1 AFE) for evaluation
test data.  I didn't use the development set for anything.

The SRI toolkit doesn't output n-grams in the right order 
for Sphinx's lm3g2dmp utility.  You'll need to resort the 
LM somehow if you want to use them with the Sphinx decoder.

Have fun!
Keith Vertanen

Revision history:
-----------------
July 6th, 2007  - Initial release of Gigaword training recipe.

Sept 15th, 2007 - Sorted LMs in pre-built LMs.