Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-P3:
Robust Speech Recognition I

Time:Monday 10:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Pietro Laface

#1A versatile Gaussian splitting approach to non-linear state estimation and its application to noise-robust ASR

Volker Leutnant (Department of Communications Engineering, University of Paderborn)
Alexander Krueger (Department of Communications Engineering, University of Paderborn)
Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn)

In this work, a splitting and weighting scheme that allows for splitting a Gaussian density into a Gaussian mixture density (GMM) is extended to allow the mixture components to be arranged along arbitrary directions. The parameters of the Gaussian mixture are chosen such that the GMM and the original Gaussian still exhibit equal central moments up to an order of four. The resulting mixtures' covariances will have eigenvalues that are smaller than those of the covariance of the original distribution, which is a desirable property in the context of non-linear state estimation, since the underlying assumptions of the extended Kalman filter are better justified in this case. Application to speech feature enhancement in the context of noise-robust automatic speech recognition reveals the beneficial properties of the proposed approach in terms of a reduced word error rate on the Aurora~2 recognition task.

#2Generalized-Log Spectral Mean Normalization for Speech Recognition

Hilman Ferdinandus Pardede (Tokyo Institute of Technology)
Koichi Shinoda (Tokyo Institute of Technology)

Most compensation methods for robust speech recognition against noise assume independency between speech, additive and convolutive noise. However, the nonlinear nature distortion caused by noise may introduce correlation between noise and speech. To tackle this issue, we propose generalized-log spectral mean normalization (GLSMN) in which log spectral mean normalization (LSMN) is carried out in the q-logarithmic domain. Experiments on the Aurora-2 database show that GLSMN improved speech recognition accuracies by 20% compared to cepstral mean normalization (CMN) in mel-frequency domain.

#3Zero-Crossing-Based Channel Attentive Weighting of Cepstral Features for Robust Speech Recognition: The ETRI 2011 CHiME Challenge System

Young-Ik Kim (ETRI)
Hoon-Young Cho (ETRI)
Sang-Hoon Kim (ETRI)

We present a practical and noise-robust speech recognition system which estimates a target-to-interferers power ratio using a zero-crossing-based binaural model and applies the power ratio to a channel attentive missing feature decoder in the cepstral domain. In a natural multisource environment, our binaural model extracts spatial cues at each zero-crossing of a filterbank output signal to localize multiple sound sources and estimates a ratio mask reliably which segregates target speech from interfering noises. Our system uses gammatone filterbank cepstral coefficients (GFCCs) for the recognition and the channel attentive decoder utilizes the ratio mask on weighting the cepstral features when calculating the output probability in the Viterbi decoding. On the experiments of CHiME final testset, our channel attentive GFCC system improves the baseline recognition result 12.2% on average, and with noisy training condition, the average improvement amounts to 18.8%.

#4Feature Compensation for Speech Recognition in Severely Adverse Environments due to Background Noise and Channel Distortion

Wooil Kim (University of Texas at Dallas)
John H. L. Hansen (University of Texas at Dallas)

This paper proposes an effective feature compensation scheme to address severely adverse environments for robust speech recognition, where background noise and channel distortion are simultaneously involved. An iterative channel estimation method is integrated into the framework of our Parallel Combined Gaussian Mixture Model based feature compensation algorithm. A new speech corpus is generated which reflects both additive and convolutional noise corruption. Performance evaluation of the proposed system demonstrates that the proposed feature compensation scheme is significantly effective in improving speech recognition performance with presence of both background noise and channel distortion, comparing to the conventional methods including the ETSI AFE.

#5Binaural cues for fragment-based speech recognition in reverberant multisource environments

Ning Ma (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)
Jon Barker (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)
Heidi Christensen (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)
Phil Green (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)

This paper addresses the problem of speech recognition using distant binaural microphones in reverberant multisource noise conditions. Our scheme employs a two stage fragment decoding approach: first spectro-temporal acoustic source fragments are identified using signal level cues, and second, a hypothesis-driven stage simultaneously searches for the most probable speech/background fragment labelling and the corresponding acoustic model state sequence. The paper reports the first successful attempt to use binaural localisation cues within this framework. By integrating binaural cues and acoustic models in a consistent probabilistic framework, the decoder is able to derive significant recognition performance benefits from fragment location estimates despite their inherent unreliability.

#6Sub-band level Histogram Equalization for Robust Speech Recognition

Vikas Joshi (Indian Institute Technology, Madras (IIT-M))
Raghvendra Biligi (Indian Institute Technology, Madras (IIT-M))
Umesh S (Indian Institute Technology, Madras (IIT-M))
Luz Garcia (University of Granada, Spain)
Carmen Benitez (University of Granada, Spain)

This paper describes a novel modification of Histogram Equalization (HEQ) approach to robust speech recognition. We propose separate equalization of the high frequency (HF) and low frequency (LF) bands. We study different combinations of the sub-band equalization and obtain best results when we perform a two-stage equalization. First, conventional HEQ is performed on the cepstral features, which does not completely equalize HF and LF bands, even though the overall histogram equalization is good. In the second stage, an equalization is done separately on the HF and the LF components of the above equalized cepstra. We refer to this approach as Sub-band Histogram Equalization (S-HEQ). The new set of features has better equalization of the sub-bands as well as the overall cepstral histogram. Recognition results show a relative improvement of 12% and 15% over conventional HEQ in WER on Aurora-2 and Aurora-4 databases respectively.

#7GMM-based missing-feature reconstruction on multi-frame windows

Ulpu Remes (Aalto University School of Science)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

Methods for missing-feature reconstruction substitute noise-corrupted features with clean-speech estimates calculated based on reliable information found in the noisy speech signal. Gaussian mixture model (GMM) based reconstruction has conventionally focussed on reliable information present in a single frame. In this work, GMM-based reconstruction is applied on windows that span several time frames. Mixtures of factor analysers (MFA) are used to limit the number of model parameters needed to describe the feature distribution as window width increases. Using the window-based MFA in noisy speech recognition task resulted in relative error reductions up to 52 % compared to frame-based GMM.

#8Improvements of a dual-input DBN for noise robust ASR

Yang Sun (Centre for Language and Speech Technology, Radboud University Nijmegen)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen)
Bert Cranen (Centre for Language and Speech Technology, Radboud University Nijmegen)
Louis ten Bosch (Centre for Language and Speech Technology, Radboud University Nijmegen)
Lou Boves (Centre for Language and Speech Technology, Radboud University Nijmegen)

In previous work we have shown that an ASR system consisting of a dual-input Dynamic Bayesian Network (DBN) which simultaneously observes MFCC acoustic features and an exemplar-based Sparse Classification (SC) phoneme predictor stream can achieve better word recognition accuracies in noise than a system that observes only one input stream. This paper explores three modifications of SC input to further improve the noise robustness of the dual-input DBN system: 1) using state likelihoods instead of phoneme, 2) integrating more contextual information and 3) using a complete set of likelihood distribution. Experiments on Aurora 2 reveal that the combination of the first two approaches significantly improves the recognition results, achieving up to 29% (absolute) accuracy gain at SNR -5 dB. In the dual-input system using the full likelihood vector does not outperform using the best state prediction.

#9Denoising Using Optimized Wavelet Filtering for Automatic Speech Recognition

Randy Gomez (Kyoto University)
Tatsuya Kawahara (Kyoto University)

We present an improved denoising method based on filtering of the noisy wavelet coefficients using a Wiener gain for automatic speech recognition (ASR). We optimize the wavelet parameters for speech and different noise profiles to achieve a better estimate of the Wiener gain for effective filtering. Moreover, we introduce a scaling parameter to the Wiener gain, to minimize mismatch caused by distortion during the denoising process. Experimental results in large vocabulary continuous speech recognition (LVCSR) show that the proposed method is effective and robust to different noise conditions.

#10Noise Robust Speaker-Independent Speech Recognition with Invariant-Integration Features Using Power-Bias Subtraction

Florian Müller (Institute for Signal Processing, University of Lübeck, Germany)
Alfred Mertins (Institute for Signal Processing, University of Lübeck, Germany)

This paper presents new results about the robustness of invariant-integration features (IIF) in noisy conditions. Furthermore, it is shown that a feature-enhancement method known as "power-bias subtraction" for noisy conditions can be combined with the IIF approach to improve its performance in noisy environments while keeping the robustness of the IIFs to mismatching vocal-tract length training-testing conditions. Results of experiments with training on clean speech only as well as experiments with matched-condition training are presented.