|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Sun-Ses2-P3: ASR - Feature Extraction I
Time: | Sunday 13:30 |
Place: | Faenza 1 - Pala Congressi (Passi Perduti-Gallery) |
Type: | Poster |
Chair: | Fabio Brugnara |
#1 | Integrating recent MLP feature extraction techniques into TRAP architecture
Frantisek Grezl (Brno University of Technology) Martin Karafiat (Brno University of Technology)
This paper is focused on the incorporation of recent techniques for multi-layer perceptron (MLP) based feature extraction in Temporal Pattern (TRAP) and Hidden Activation TRAP (HATS) feature extraction scheme. The TRAP scheme has been origin of various MLP-based features some of which are now indivisible part of state-of-the-art LVCSR systems. The modifications which brought most improvement -- sub-phoneme targets and Bottle-Neck technique -- are introduced into original TRAP scheme. Introduction of sub-phoneme targets uncovered the hidden danger of having too many classes in TRAP/HATS scheme. On the other hand, Bottle-Neck technique improved the TRAP/HATS scheme so its competitive with other
approaches.
|
#2 | Feature Frame Stacking in RNN-based Tandem ASR Systems - Learned vs. Predefined Context
Martin Woellmer (Technische Universitaet Muenchen) Bjoern Schuller (Technische Universitaet Muenchen) Gerhard Rigoll (Technische Universitaet Muenchen)
As phoneme recognition is known to profit from techniques that consider contextual information, neural networks applied in Tandem automatic speech recognition (ASR) systems usually employ some form of context modeling. While approaches based on multi-layer perceptrons or recurrent neural networks (RNN) are able to model a predefined amount of context by simultaneously processing a stacked sequence of successive feature vectors, bidirectional Long Short-Term Memory (BLSTM) networks were shown to be well-suited for incorporating a self-learned amount of context for phoneme prediction. In this paper, we evaluate combinations of BLSTM modeling and frame stacking to determine the most efficient method for exploiting context in RNN-based Tandem systems. Applying the COSINE corpus and our recently introduced multi-stream BLSTM-HMM decoder, we provide empirical evidence for the intuition that BLSTM networks redundantize frame stacking while RNNs profit from predefined feature-level context.
|
#3 | Improved Acoustic Feature Combination for LVCSR by Neural Networks
Christian Plahl (RWTH Aachen University) Ralf Schlüter (RWTH Aachen University) Hermann Ney (RWTH Aachen University)
This paper investigates the combination of different acoustic
features. Several methods to combine these features such as
concatenation or LDA are well known.
Even though LDA improves the system, feature combination
by LDA has been shown to be suboptimal.
We introduce a new method based on neural networks.
The posterior estimates derived from the NN lead to
a significant improvement and achieve a 6%
relative better word error rate (WER).
Results are also compared to system combination.
While system combination has been reported to outperform all
other combination techniques, in this work the proposed
NN-based combination outperforms system combination.
We achieve a 2% relative better WER,
resulting in an improvement of 7% relative to the baseline system.
In addition to giving better recognition performance w.r.t. WER,
NN-based combination reduces both, training and
testing complexity. Overall, we use a single set of acoustic models,
together with the training of the NN.
|
#4 | Hierarchical Tandem Features for ASR in Mandarin
Joel Pinto (Idiap Research Institute) Mathew Magimai.-Doss (Idiap Research Institute) Herve Bourlard (Idiap Research Institute)
We apply multilayer perceptron (MLP) based hierarchical Tandem features to large vocabulary continuous speech recognition in Mandarin. Hierarchical Tandem features are estimated using a cascade of two MLP classifiers which are trained independently. The first classifier is trained on perceptual linear predictive coefficients with a 90 ms temporal context. The second classifier is trained using the phonetic class conditional probabilities estimated by the first MLP, but with a relatively longer temporal context of about 150 ms. Experiments on the Mandarin DARPA GALE eval06 dataset show significant reduction (about 7.6% relative) in character error rates by using hierarchical Tandem features over conventional Tandem features.
|
#5 | Analysis and Comparison of Recent MLP Features for LVCSR Systems
Fabio Valente (Idiap Research Institute) Mathew Magimai Doss (Idiap Research Institute) Wen Wang (SRI International)
MLP based front-ends have evolved in different ways in recent years beyond the seminal TANDEM-PLP features. This paper aims at providing a fair comparison of these recent progresses including the use of different long/short temporal inputs and the use of complex architectures (bottleneck, hierarchy,multistream) that go beyond the conventional three layer MLP. Furthermore, the paper identifies which of these actually provide advantages over the conventional TANDEM-PLP . The investigation is carried on an LVCSR task for recognition of Mandarin Broadcast speech and results are analyzed in terms of Character Error Rate and phonetic confusions. Results reveal that as stand alone features, multistream front-ends can outperform by 10% conventional spectral features like MFCC while TANDEM-PLP only improve by 1% . When used in concatenation with MFCC features, hierarchical/bottleneck front-ends reduce the character error rate by +18% relative compared to +14% relative from TANDEM-PLP.
|
#6 | Deep Learning of Speech Features for Improved Phonetic Recognition
Jaehyung Lee (KAIST) Soo-Young Lee (KAIST)
Recently, a remarkable performance result of 23.0% Phone Error Rate (PER) on the TIMIT core test set was reported by applying Deep Belief Network (DBN) on phonetic recognition [1]. Despite the good performance reported, there is still sub¬stantial room for improvement in the reported design in order to achieve optimal results. In this letter, we present an improved but simple architecture for phonetic recognition which uses log¬Mel spectrum directly instead of Mel¬Frequency Cepstral Coefficient (MFCC), and combines Deep Learning with conventional Baum¬Welch re-estimation for subphoneme alignment. Experiments performed on TIMIT speech corpus show that the proposed method outperforms most of the conventional methods, yielding 21.4% PER on the complete test set of TIMIT and 22.1% on the core test set.
|
#7 | Globality-Locality Consistent Discriminant Analysis for Phone Classification
Heyun Huang (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands) Yang Liu (Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong) Jort Gemmeke (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands) Louis ten Bosch (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands) Bert Cranen (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands) Lou Boves (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands)
Concatenating sequences of feature vectors helps to capture essential information about articulatory dynamics, at the cost of increasing the number of dimensions in the feature space, which may be characterized by the presence of manifolds. Existing supervised dimensionality reduction methods such as Linear Discriminant Analysis may destroy part of that manifold structure. In this paper, we propose a novel supervised dimensionality reduction algorithm, called Globality-Locality Consistent Discriminant Analysis (GLCDA), which aims to preserve global and local discriminant information simultaneously. Because it allows finding the optimal trade-off between global and local structure of data sets, GLCDA can provide a more faithful compact representation of high-dimensional observations than entirely global approaches or heuristic approaches aimed to preserve local information. Experimental results on the TIMIT phone classification task show the effectiveness of the proposed algorithm.
|
#8 | Front-End Compensation Methods for LVCSR Under Lombard Effect
Hynek Boril (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas) Frantisek Grezl (Speech@FIT, Brno University of Technology) John H.L. Hansen (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas)
This study analyzes the impact of noisy background variations and Lombard effect (LE) on large vocabulary continuous speech recognition (LVCSR). Robustness of several front-end feature extraction strategies combined with state-of-the-art feature distribution normalizations is tested on neutral and Lombard speech from the UT-Scope database presented in two types of background noise at various levels of SNR. An extension of a bottleneck (BN) front-end utilizing normalization of both critical band energies (CRBE) and BN outputs is proposed and shown to provide a competitive performance compared to the best MFCC-based system. A novel MFCC-based BN front-end is introduced and shown to outperform all other systems in all conditions considered (average 4.1% absolute WER reduction over the second best system). Additionally, two phenomena are observed: (i) combination of cepstral mean subtraction and recently established RASTALP filtering significantly reduces transient effects of RASTA band-pass filtering and increases ASR robustness to noise and LE; (ii) histogram equalization may benefit from utilizing reference distributions derived from pre-normalized rather than raw training features, and also from adopting distributions from different front-ends.
|
#9 | Classification of Fricatives Using Feature Extrapolation of Acoustic-Phonetic Features in Telephone Speech
Jung-Won Lee (Department School of Electrical and Electronic Engineering, Yonsei University) Jeung-Yoon Choi (Department School of Electrical and Electronic Engineering, Yonsei University) Hong-Goo Kang (Department School of Electrical and Electronic Engineering, Yonsei University)
This paper proposes a classification module for fricative consonants in telephone speech using an acoustic-phonetic feature extrapolation technique. In channel-deteriorated telephone speech, acoustic cues of fricative consonants are expected to be degraded or missing due to limited bandwidth. This paper applies an extrapolation technique to acoustic-phonetic features based on Gaussian mixture models, which uses a statistical learning of the correspondence between acoustic-phonetic feature of wideband speech and the spectral characteristics of telephone bandwidth speech. Experimental results with NTIMIT database verify that the feature extrapolation improves the performance of fricative classification module for all unvoiced fricatives by 0.5-5% (relative) compared to the performance obtained by only acoustic-phonetic features extracted from narrowband signal.
|
#10 | Noise Robust Feature Extraction Based on Extended Weighted Linear Prediction in LVCSR
Sami Keronen (Aalto University School of Science and Technology) Jouni Pohjalainen (Aalto University School of Science and Technology) Paavo Alku (Aalto University School of Science and Technology) Mikko Kurimo (Aalto University School of Science and Technology)
This paper introduces extended weighted linear prediction (XLP) to noise robust short-time spectrum analysis in the feature extraction process of a speech recognition system. XLP is a generalization of standard linear prediction (LP) and temporally weighted linear prediction (WLP) which have already been applied to noise robust speech recognition with good results. With XLP, higher controllability to the temporal weighting of different parts of the noisy speech is gained by taking the lags of the signal into account in prediction. Here, the performance of XLP is put up against WLP and conventional spectrum analysis methods FFT and LP on a large vocabulary continuous speech recognition (LVCSR) scheme using real world noisy data containing additive and convolutive noise. The results show improvements over the reference methods in several cases.
|
#11 | Comparing Different Flavors of Spectro-Temporal Features for ASR
Bernd T. Meyer (International Computer Science Institute, Berkeley, CA, USA) Suman V. Ravuri (International Computer Science Institute, Berkeley, CA, USA) Marc René Schädler (Medical Physics, Institute of Physics, University of Oldenburg, Germany) Nelson Morgan (International Computer Science Institute, Berkeley, CA, USA)
In the last decade, several studies have shown that the robustness of ASR systems can be increased when 2D Gabor filters are used to extract specific modulation frequencies from the input pattern. This paper analyzes important design parameters for spectro-temporal features based on a Gabor filter bank: We perform experiments with filters that exhibit different phase sensitivity. Further, we analyze if non-linear weighting with a multi-layer perceptron (MLP) and a subsequent concatenation with mel frequency cepstral coefficients (MFCCs) has beneficial effects. For the Aurora2 noisy digit recognition task, the use of phase sensitive filters improved the MFCC baseline, whereas using filters that neglect phase information did not. While MLP processing alone did not have a large effect on the overall scores, the best results were obtained for MLP-processed phase sensitive filters and added MFCCs, with relative error reductions of over 40% for both noisy and clean training.
|
#12 | VTLN in the MFCC domain: band-limited versus local interpolation
Ehsan Variani (Johns Hopkins University) Thomas Schaaf (Multimodal Technologies, Inc.)
We propose a new easy-to-implement method to compute a Linear Transform (LT) to perform Vocal Tract Length Normalization (VTLN) on truncated Mel Frequency Cepstral Coefficients (MFCCs) normally used in distributed speech recognition. The method is based on a Local Interpolation which is independent of the Mel filter design. Local Interpolation (LILT) VTLN is theoretically and experimentally compared to a global scheme based on band-limited interpolation (BLI-VTLN) and the conventional frequency warping scheme (FFT-VTLN). Investigating the interoperability of these methods shows that the performance of LILT-VTLN is on par with FFT-VTLN and BLI-VTLN. The statistical significance test also shows that there are no significant differences between FFT-VTLN, LILT-VTLN, and BLI-VTLN, even if the models and front ends do not match.
|
#13 | Multistream Bandpass Modulation Features for Robust Speech Recognition
Sridhar Krishna Nemala (Johns Hopkins University) Kailash Patil (Johns Hopkins University) Mounya Elhilali (Johns Hopkins University)
Current understanding of speech processing in the brain suggests dual streams of processing of temporal and spectral information, whereby slow vs. fast modulations are analyzed along parallel paths that encode various scales of information in speech signals. In this work, we propose a multistream approach to feature analysis for robust speaker-independent phoneme recognition. The scheme presented here centers around a multi-path bandpass modulation analysis of speech sounds with each stream covering an entire range of temporal and spectral modulations. By performing bandpass operations of slow vs. fast information along the spectral and temporal dimensions, the proposed scheme avoids the classic feature explosion problem of previous multistream approaches while maintaining the advantage of parallelism and localized feature analysis. The proposed architecture results in substantial improvements over standard baseline features and two state-of-the-art noise robust feature schemes.
|
#14 | An Analysis of Automatic Speech Recognition with Multiple Microphones
Davide Marino (University of Sheffield) Thomas Hain (University of Sheffield)
Automatic speech recognition in real world situations often requires the use of microphones distant from speaker’s mouth. One or several microphones are placed in the surroundings to capture many versions of the original signal. Recognition with a single far field microphone yields considerably poorer performance than with person-mounted devices (headset, lapel), with the main causes being reverberation and noise. Acoustic beam- forming techniques allow significant improvements over the use of a single microphone, although the overall performance still remains well above the close-talking results. In this paper we investigate the use of beam-forming in the context of speaker movement, together with commonly used adaptation techniques and compare against a naive multi-stream approach. We show that even such a simple approach can yield equivalent results to beam-forming, allowing for far more powerful integration of multiple microphone sources in ASR systems.
|
|
|