Discussion on speech recognition on the Nokia N800

Home > Software > Parakeet > Speech recognition Nokia N800 > Discussion and recommendations

Discussion

I investigated doing real-time speech recognition on the Nokia N800. I found that adapting the acoustic model to the speaker and microphone was very beneficial and the majority of benefit was gained in the first 30 adaptation utterances. But despite the gains offered by adaptation, errors rates were still high on my test sets (newswire 30%, email 28%, SMS 48%). I found I could significantly improve accuracy by searching harder, using wideband audio, using a continuous acoustic model, and using a longer span language model. But none of these options are currently a realistic reality given the resource and device constraints of the N800.

The test sets I used were quite hard. The email and SMS test sets in particular are probably suffering due to being out-of-domain compared to the language model's newswire training data. It is possible that with a large amount of email- or SMS-like training data, the error rates would be more manageable. In addition, the newswire and email test sets were mismatched with the language model's training data which used verbalized punctuation and multiple sentence blocks.

I found the error rates were simply too high to field a short message dictation on the N800. The N800's resource constraints in terms of processor and memory are preventing some significant accuracy improvements. In addition, the N800's limitation of capturing only narrowband audio was found to be damaging. So instead of short message dictation, I opted for an easier 5K closed-vocabulary newswire dictation task. I found on this task I could obtain acceptable error levels (9%) while performing real-time recognition on the N800.

I described how a word confusion network could be used to provide likely word alternates to the best words in recognizer's result. I showed that by allowing users to select from a small number of alternates for every word in the best result, a substantial number of errors could be corrected. Further, I showed that allowing deletion of words and copying words between different parts of the confusion network allowed even more errors to be corrected.

I built a tablet-based user interface allowing review of the confusion network and allowing correction of errors. The interface allowed the user to select between alternates by tapping or by stroking through words. The UI allowed words to be copied between different areas of the confusion network. In addition, the UI supported arbitrary corrections using the standard on-screen keyboard.

Recommendations

Wideband audio capture - I found a large 21% relative improvement in WER using 16kHz audio as opposed to 8kHz. This improvement is available with no significant increase in speech decoding time. If recognition is occurring locally on the device, it shouldn't be throwing away valuable information.

Investigate bluetooth audio glitch - I discovered that every time I started recording on my bluetooth headset after a period of inactivity, the N800 or the mic was generating a glitch in the audio stream. This caused significant problems if the glitch was not removed prior to recognition. It would be good to understand what is actually going on here rather than just dropping the first part of the audio stream.

Better language models - I suspect recognition would be much better on my email and SMS test sets if I had an appropriately trained language model for the recognition domain. The Enron corpus perhaps could serve as a large enough training set for email. Publicly available SMS corpora are small and limited in quality.

Faster decoding - I had to leave fairly large chunks of recognition accuracy on the table to achieve real-time performance on the N800. A device even a few times faster would allow search errors to be reduced and allow more complex acoustic modeling to be used. Improvements in the efficiency of the PocketSphinx decoder would also help.

More memory - On my test sets, I had out-of-vocabulary (OOV) rates of between 2-6%. This OOV rate and the resulting recognition errors could be reduced if the recognizer used a larger vocabulary language model. This would require more memory. It might also be possible to increase vocabulary size without increasing memory footprint through more judicious pruning of the language model.