Maximum Mutual Information (MMI) criterion next up previous
Next: gradient wrt transition probabilities Up: The Learning Problem Previous: gradient wrt observation probabilities

Maximum Mutual Information (MMI) criterion

In ML we optimize an HMM of only one class at a time, and do not touch the HMMs for other classes at that time. This procedure does not involve the concept ``discrimination'' which is of great interest in Pattern Recognition and of course in ASR. Thus the ML learning procedure gives a poor discrimination ability to the HMM system, specially when the estimated parameters (in the training phase) of the HMM system do not match with the speech inputs used in the recognition phase. This type of mismatches can arise due to two reasons. One is that the training and recognition data have considerably different statistical properties, and the other is the difficulties of obtaining reliable parameter estimates in the training.
The MMI criterion on the other hand consider HMMs of all the classes simultaneously, during training. Parameters of the correct model are updated to enhance it's contribution to the observations, while parameters of the alternative models are updated to reduce their contributions. This procedure gives a high discriminative ability to the system and thus MMI belongs to the so called ``discriminative training'' category.
In order to have a closer look at the MMI criterion, consider a set of HMMs

displaymath2932

The task is to minimize the conditional uncertainty of a class tex2html_wrap_inline2934 of utterances given an observation sequence tex2html_wrap_inline2936 of that class. This is equivalent minimize the conditional information,

  equation675

wrt tex2html_wrap_inline2942 .
In an information theoretical frame work this leads to the minimization of conditional entropy, defined as the expectation ( tex2html_wrap_inline2944 ) of the conditional information I,

  equation680

where tex2html_wrap_inline2952 represents all the classes and tex2html_wrap_inline2954 represents all the observation sequences. Then the mutual information between the classes and observations,

  equation688

become maximized; provided tex2html_wrap_inline2960 is constant. This is the reason for calling it Maximum Mutual Information (MMI) criterion. The other name of the method, Maximum A Posteriori (MAP) has the roots in eqn. 1.31 where the a posteriori probability tex2html_wrap_inline2962 is maximized.

Even though the eqn.1.31 defines the MMI criterion, it can be rearranged using the Bayes theorem to obtain a better insight, as in eqn.1.34.

  eqnarray701

where w represents an arbitrary class.

If we use an analogous notation as in eqn.1.9, we can write the likelihoods,

  equation712

  equation718

In the above equations the superscripts clamped and free are used to imply the correct class and all the other classes respectively.

If we substitute eqns. 1.35 and 1.36 in the eqn.1.34, we get,

  equation728

As in the case of ML re-estimation [] or gradient methods can be used to minimize the quantity tex2html_wrap_inline2984 . In the following a gradient based method, which again makes use of the eqn.1.19, is described.

Since tex2html_wrap_inline2984 is to be minimized, in this case

displaymath2988

and therefore J is directly given by eqn.1.37. The problem then simplifies to the calculation of gradients tex2html_wrap_inline2868 ,where tex2html_wrap_inline2856 is an arbitrary parameter of the whole set of HMMs, tex2html_wrap_inline2942 . This can be done by differentiating 1.37 wrt tex2html_wrap_inline2856 ,

  equation744

The same technique, as in the case of ML, can be used to compute the gradients of the likelihoods wrt the parameters. As a first step likelihoods from eqns.1.35 and 1.36, are expressed in terms of forward and backward variables using the form as in eqn.1.7.

  equation764

  equation774

Then the required gradients can be found by differentiating eqns. 1.39 and 1.40. But we consider two cases; one for the transition probabilities and another for the observation probabilities, similar to the case of ML.




next up previous
Next: gradient wrt transition probabilities Up: The Learning Problem Previous: gradient wrt observation probabilities

Narada Warakagoda
Fri May 10 20:35:10 MET DST 1996

Home Page