When we come to the recognition phase it is assumed that trained HMMs for each of the speech units in the vocabulary are available. The task is to find the underlying speech unit sequence, given an observation sequence corresponding to an unknown sentence. Mathematically this operation can be expressed as,
where, is an arbitrary speech unit sequence of arbitrary length S. Since is provided by the statistical language model, the only thing that should be done in order to find is to calculate for every possible It is obvious that this procedure will be computationally very expensive, because there can be very large number of sentences, even for a small vocabulary. A cheaper solution is to approximate the procedure, by finding the most likely state sequence, in the language model , instead of the speech unit sequence . Formally,
Then it is possible to trace for the corresponding speech unit
sequence, via the state sequence. In order to calculate we
can use the Viterbi Algorithm directly, or the method called Level
building, a variant of the Viterbi Algorithm. Since the Viterbi
based recognition is suboptimal, unless the each speech unit is
corresponding to a HMM state, some attempts have been made to develop
efficient methods for calculating the sentence likelihoods. The so
called N-best algorithm is one of the this.
There is another problem associated with continuous recognition, which does not arise in connection with isolated recognition. Due to the complicated and approximated recognition (decoding) procedures in continuous mode, mismatches can arise between them and training procedures. For example in MMI training we try to maximize the correct sentence probability, against the alternative sentences. But in the recognition phase if use the Viterbi algorithm for decoding, then there will be a mismatch, because it gives the optimum state sequence and not the optimum sentence. In order to reduce such mismatches, several modifications to the basic MMI training criterion has been suggested [, ]. One such training criterion is so called embedded Viterbi training, where at each t=t, probability of the correct state against the alternative states are maximized. Another suggestion is to maximize the probability of the correct state sequence against the alternative sentences. This training method is consistent with the Viterbi decoding. Finally it is worth to mention that the decoding based on N-best algorithm is more consistent with MMI training than Viterbi based decoding.