English Gigaword Language Model Training Recipe ----------------------------------------------- http://www.keithv.com/software/giga/ This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus. It also prepares dictionaries needed to use the LMs with the HTK and Sphinx speech recognizers. After conditioning, the total words from each source was (non-verbalized punctuation, excluding start/end words): NYT 642M APW 290M AFE 142M XIE 112M Total 1186M About 50M words were held out from the training data to serve as eval and dev test sets. Requirements: Linux computer with Perl installed LDC English Gigaword corpus (only tested with 1st edition) SRILM language modeling toolkit A fair bit of disk space and memory Basic steps: 1) Set GIGA_DATA to point to the directory in the Gigaword corpus above the afe, apw, nyt, and xie directories that contain the *.gz training files. GIGA_DATA=/rd/corpus/gigaword/;export GIGA_DATA 2) Install SRILM toolkit, make sure the binaries are on your path. 3) Download the CMU dictionary and put it in the recipe directory under the name "c0.6". 4) A full set of LMs using 5K, 20K, 64K vocabs and both verbalized (VP) and non-verbalized punctuation (NVP) can be built by running "go_all.sh" Defaults used by script to build LM: * Interpolated, modified Kneser-Ney smoothing * 2-gram cutoff 3, 3-gram cutoff 5 * Built with unknown word * Both verbalized punctuation (VP) and non-verbalized punctuation (NVP) LMs are built Note that the text conditioning steps are particularly time consuming, plan on it taking a few days. I did the best I could to cleanup the data- removing garbage, redundant articles, cricket scores, etc. I used some of the scripts/programs provided in the CSR LM-1 corpus written by MIT/LDC/David Graff as well as some stuff I cooked up myself. Results: Here are some OOV% and perplexity results measured on three different held-out evaluation test sets: * giga - held-out portion of Gigaword training data (~25M words) * setasd - text from the CSR LM-1 setaside dir (~15M words) * csr - held-out portion of CSR LM-1 training data from my CSR LM-1 training recipe (~6M words) Note that the CSR test sets are from different newswire sources and cover a different historical period than Gigaword. +-------+------+--------+------+-------+-------+-------+-------+-------+ | Vocab | Punc | Size | giga | giga | setasd| setasd| csr | csr | | | | | OOV% | ppl | OOV% | ppl | OOV% | ppl | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | NVP | 2-gram | 12.5 | 123.7 | 12.0 | 132.5 | 12.1 | 130.9 | | 20K | NVP | 2-gram | 4.1 | 199.7 | 3.7 | 215.9 | 3.8 | 213.9 | | 64K | NVP | 2-gram | 1.7 | 238.6 | 1.1 | 264.9 | 1.2 | 262.5 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | VP | 2-gram | 11.4 | 96.4 | 10.8 | 103.6 | 11.0 | 104.0 | | 20K | VP | 2-gram | 3.8 | 141.8 | 3.4 | 151.2 | 3.5 | 153.0 | | 64K | VP | 2-gram | 1.8 | 161.4 | 1.2 | 176.2 | 1.2 | 178.9 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | NVP | 3-gram | 12.5 | 82.0 | 12.0 | 91.1 | 12.1 | 89.7 | | 20K | NVP | 3-gram | 4.1 | 121.3 | 3.7 | 138.5 | 3.8 | 136.4 | | 64K | NVP | 3-gram | 1.7 | 144.6 | 1.1 | 170.1 | 1.2 | 167.5 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | VP | 3-gram | 11.4 | 62.0 | 10.8 | 67.8 | 11.0 | 67.9 | | 20K | VP | 3-gram | 3.8 | 79.3 | 3.4 | 88.2 | 3.5 | 88.7 | | 64K | VP | 3-gram | 1.8 | 90.3 | 1.2 | 103.1 | 1.2 | 104.1 | +-------+------+--------+------+-------+-------+-------+-------+-------+ The above used a bigram cutoff of 3 and a trigram cutoff of 5. Here are some results using NVP LMs with lower cutoffs of 1 and 3: +-------+------+--------+------+-------+-------+-------+-------+-------+ | Vocab | Punc | Size | giga | giga | setasd| setasd| csr | csr | | | | | OOV% | ppl | OOV% | ppl | OOV% | ppl | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | NVP | 2-gram | 12.5 | 114.9 | 12.0 | 123.7 | 12.1 | 122.1 | | 20K | NVP | 2-gram | 4.1 | 194.1 | 3.7 | 210.4 | 3.8 | 208.5 | | 64K | NVP | 2-gram | 1.7 | 233.1 | 1.1 | 259.7 | 1.2 | 257.3 | +-------+------+--------+------+-------+-------+-------+-------+-------+ | 5K | NVP | 3-gram | 12.5 | 72.7 | 12.0 | 81.4 | 12.1 | 80.0 | | 20K | NVP | 3-gram | 4.1 | 113.5 | 3.7 | 130.5 | 3.8 | 128.5 | | 64K | NVP | 3-gram | 1.7 | 134.8 | 1.1 | 159.9 | 1.2 | 157.2 | +-------+------+--------+------+-------+-------+-------+-------+-------+ Notes: The scripts use 293 out of the 314 original text data files for training, 7 (2 APW, 2 XIE, 2 NYT, 1 AFE) for development test data, and 7 (2 APW, 2 XIE, 2 NYT, 1 AFE) for evaluation test data. I didn't use the development set for anything. The SRI toolkit doesn't output n-grams in the right order for Sphinx's lm3g2dmp utility. You'll need to resort the LM somehow if you want to use them with the Sphinx decoder. Have fun! Keith Vertanen Revision history: ----------------- July 6th, 2007 - Initial release of Gigaword training recipe. Sept 15th, 2007 - Sorted LMs in pre-built LMs.