Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Wed-Ses2-O3:
Adaptation for ASR

Time:Wednesday 13:30 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Phil Woodland

13:30Model Adaptation for Automatic Speech Recognition Based on Multiple Time Scale Evolution

Shinji Watanabe (NTT Corporation)
Atsushi Nakamura (NTT Corporation)
Biing-Hwang Juang (Georgia Institute of Technology)

The change in speech characteristics is originated from various factors, at various (temporal) rates in a real world conversation. These temporal changes have their own dynamics and therefore, we propose to extend the single (time-) incremental adaptations to a multiscale adaptation, which has the potential of greatly increasing the model's robustness as it will include adaptation mechanism to approximate the nature of the characteristic change. The formulation of the incremental adaptation assumes a time evolution system of the model, where the posterior distributions, used in the decision process, are successively updated based on a macroscopic time scale in accordance with the Kalman filter theory. In this paper, we extend the original incremental adaptation scheme, based on a single time scale, to multiple time scales, and apply the method to the adaptation of both the acoustic model and the language model. We further investigate methods to integrate the multi-scale adaptation scheme to realize the robust speech recognition performance. Large vocabulary continuous speech recognition experiments for English and Japanese lectures revealed the importance of modeling multiscale properties in speech recognition.

13:50Integrated Online Speaker Clustering and Adaptation

Catherine Breslin (Toshiba Research Europe Ltd.)
KK Chin (Toshiba Research Europe Ltd.)
Mark Gales (Toshiba Research Europe Ltd.)
Kate Knill (Toshiba Research Europe Ltd.)

For many applications, it is necessary to produce speech transcriptions in a causal fashion. To produce high quality transcripts, speaker adaptation is often used. This requires online speaker clustering and incremental adaptation techniques to be developed. This paper presents an integrated approach to online speaker clustering and adaptation which allows efficient clustering of speakers using the same accumulated statistics that are normally used for adaptation. Using a consistent criterion for both clustering and adaptation should yield gains for both stages. The proposed approach is evaluated on a meetings transcription task using audio from multiple distant microphones. Consistent gains over standard clustering and adaptation were obtained.

14:10A study on speaker normalized MLP features in LVCSR

Zoltán Tüske (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University)
Christian Plahl (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University)
Ralf Schlüter (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University)

Different normalization methods are applied in recent Large Vocabulary Continuous Speech Recognition Systems (LVCSR) to reduce the influence of speaker variability on the acoustic models. In this paper we investigate the use of Vocal Tract Length Normalization (VTLN) and Speaker Adaptive Training (SAT) in Multi Layer Perceptron (MLP) feature extraction on an English task. We achieve significant improvements by each normalization method and we gain further by stacking the normalizations. Studying features transformed by Constrained Maximum Likelihood Linear Regression (CMLLR) based SAT as possible input for MLP, further experiments show that MLP could not consistently take advantage of SAT as it does in case of VTLN.

14:30Matrix-Variate Distribution of Training Models for Robust Speaker Adaptation

Yongwon Jeong (Pusan National University)
Young Kuk Kim (LG Electronics)

In this paper, we describe a new speaker adaptation method based on the matrix-variate distribution of training models. A set of mean vectors of hidden Markov models (HMMs) is assumed to be drawn from the matrix-variate normal distribution, and bases are derived under this assumption. The resulting bases have the same dimension as that of the eigenvoice, thus adaptation can be performed using the same equation. In the isolated-word experiments, the proposed method showed a comparable performance with the eigenvoice in a clean environment, and showed better performance than the eigenvoice in both babble and factory floor noises. The experimental results demonstrated the validity of the matrix-variate normal assumption about the training models, thus the proposed method can be used for rapid speaker adaptation in noise environments.

14:50Separating Speaker and Environmental Variability Using Factored Transforms

Michael Seltzer (Microsoft Research)
Alex Acero (Microsoft Research)

Two primary sources of variability that degrade accuracy in speech recognition systems are the speaker and the environment. While many algorithms for speaker or environment adaptation have been proposed to improve performance, far less attention has been paid to approaches which address for both factors. In this paper, we present a method for compensating for speaker and environmental mismatch using a cascade of CMLLR transforms. The proposed approach enables speaker transforms estimated in one environment to be effectively applied to speech from the same user in a different environment. This approach can be further improved using a new training method called speaker and environment adaptive training method. When applying speaker transforms to new environments, the proposed approach results in a 13% relative improvement over conventional CMLLR.

15:10Your Mobile Virtual Assistant Just Got Smarter!

Mazin Gilbert (AT&T;)
Iker Arizmendi (AT&T;)
Enrico Bocchieri (AT&T;)
Diamantino Caseiro (AT&T;)
Vincent Goffin (AT&T;)
Andrej Ljolje (AT&T;)
Mike Philips (Vlingo)
Chao Wang (Vlingo)
Jay Wilpon (AT&T;)

A Mobile Virtual Assistant (MVA) is a communication agent that recognizes and understands free speech, and performs actions such as retrieving information and completing transactions. One essential characteristic of MVAs is their ability to learn and adapt without supervision. This paper describes our ongoing research in developing more intelligent MVAs that recognize and understand very large vocabulary speech input across a variety of tasks. In particular, we present our architecture for unsupervised acoustic and language model adaptation. Experimental results show that unsupervised acoustic model learning approaches the performance of supervised learning when adapting on 40-50 device-specific utterances. Unsupervised language model learning results in an 8% absolute drop in word error rate.