The idea of wider window has been described in section in
relation with eqns. and . We use in this case
for example 3 times longer input speech segments than the usual frames.
Then the front end neural network which performs the windowing and
Hartley (or Fourier) transform will generate 3 vectors which are now
fed to the filter bank as if they are consecutive vectors. The idea is
that each of these three vectors have been generated within a context of longer
time interval. This is in principle the same procedure of
contextual information extraction in [, ]. The difference of the two
methods is however the location at which the contextual information is
extracted. In the suggested method it is done at the very beginning
while in the other method it is done after pre-processing. Contextual
information extraction at an early point has the advantage that we can
make use of all the information available from the original speech
signal. At a later point however we have only a signal which has
lost its information due to the reduction of dimensionality. Therefore
better results can be expected from the suggested method. The problem
of this approach is however the network size which may be very large.
Any way, a system which accepts a 3 times longer window than a 400
sample-frame, can be implemented on a usual SUN work station.