In this project, a hybrid ANN-HMM speech recognizer with NN based
adaptive pre-processing was studied, with an emphasis on the
pre-processing part. Feasibility of such a system was proved first by
formulating the classical pre-processing based on mel scale cepstral
coefficients as a Neural Network, and then by optimizing the whole
system as single unit, with MMI criterion. Twelve experiments,
involving a speaker independent, 5-broad class, isolated phoneme
recognition task, were carried out with various modifications
introduced to the pre-processing part. Degree of adaptivity of
pre-processing in those experiments was varied within a large range,
from complete non-adaptivity to full adaptivity with different
structures.
It was shown that full adaptivity has problems with generalization
performances while the non-adaptivity has inferior learning ability.
Possibility to improve the generalization was demonstrated, by
reducing the number of free parameters, and several methods were tried
for parameter reduction.
For preprocessing, both MLP and recurrent structures were tried. A
structure with a layer of Recurrent Neurons operating at the front end
of the system was shown to be the best in recognition performance,
and a mathematical treatment was given to prove that such a recurrent
layer actually performs the optimal short time Fourier transform in
the sense of recognition error.
Finally a comparative evaluation was given for all the modified versions of the system, to illustrate the superiority and un-substitutability of the adaptive pre-processing.