We assume that the preprocessing part of the system gives out a sequence of observation vectors

Starting from a certain set of values, parameters of each of the HMMs

can be updated as given
by the eqn.1.19, while the required gradients will be given
by eqns. 1.44 and 1.48. However for this particular
case, isolated recognition, likelihoods in the the last two equations
are calculated in a peculiar way.

First consider the *clamped* case. Since we have an HMM for each class
of units in isolated recognition, we can select the model
of the class *l* to which the current observation
sequence belongs. Then starting from eqn.
1.39 ,

where the second line follows from eqn.1.3.

Similarly for the *free* case, starting from eqn. 1.40,

where represents the likelihood of the current observation
sequence belonging to class *l*, in the model . With those
likelihoods defined in eqns.1.52 and 1.53, the
gradient giving equations 1.44 and 1.48 will take
the forms,

Now we can summarize the training procedure as follows.

**(1)**- Initialize the each HMM,
with values generated randomly or using an initialization algorithm
like
*segmental K means*[]. **(2)**- Take an observation sequence and
- Calculate the forward and backward probabilities for each HMM, using the recursions 1.5 and 1.2.
- Using the equations 1.52 and 1.53 calculate the likelihoods
- Using the equations 1.54 and 1.55 calculate the gradients wrt parameters for each model
- Update parameters in each of the models using the eqn.1.19.

**(3)**- Go to step (2), unless all the observation sequences are considered.
**(4)**- Repeat step(2) to (3) until a convergence criterion is
satisfied.

This procedure can easily be modified if the continuous density HMMs are used, by propagating the gradients via chain rule to the parameters of the continuous probability distributions. Further it is worth to mention that preprocessors can also be trained simultaneously, with such a further back propagation.

Fri May 10 20:35:10 MET DST 1996