Reddit reviews Spoken Language Processing: A Guide to Theory, Algorithm and System Development

We found 2 Reddit comments about Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Here are the top ones, ranked by their Reddit score.

Check price on Amazon

2 Reddit comments about Spoken Language Processing: A Guide to Theory, Algorithm and System Development:

u/[deleted] · 3 pointsr/programming

Also, this is the standard textbook on speech recognition.

To address submitter's problem, if you want speech rec as a tool instead of a research problem, you're almost certainly going to be better off using Dragon or Microsoft than trying to train your own Sphinx/HTK system.

u/Megatron_McLargeHuge · 1 pointr/MachineLearning

There are a million details as others have said. You don't know how much you're missing.

This is the book to read for traditional HMM-based ASR.

Ignore the discussion of Baum-Welch. The HMM isn't trained in the normal ways since 1. it's huge, and 2. there's limited data. The transition probabilities come from your language model. The HMM topology is usually to have three states per phone-in-context, and to use a dictionary of pronunciation variants for each word.

Each state has a GMM to model the probabilities of the features. The features are MFCCs of a frame plus deltas and double deltas from the MFCCs of the previous frame. You'll probably use a diagonal covariance matrix.

Remember I said phone-in-context? That's because the actual pronunciation of a phoneme depends on the phonemes around it. You have to learn clusters of these since there are too many contexts to model separately.

Training data: to train, you need alignments of words and their pronunciations to audio frames. This pretty much requires using an existing recognizer to do labeling for you. You give it a restricted language model to force it to recognize what was said and use the resulting alignment as training data.

Extra considerations: how to model silence (voice activity detector), how to handle pauses and "ums" (voiced pauses). How to handle mapping non-verbatim transcripts to how they might have been spoken (how did he say 1024?). How to adapt to individual speakers. How to collapse states of the HMM into a lattice. How to handle backoff from long ngrams to short ones in your language model.

Needless to say, I don't recommend this for a master's thesis.