Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-P4:
ASR - Acoustic Models I

Time: Monday 10:00 Place: Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type: Poster

Chair: Lori Lamel

#1 Semi-automatic acoustic model generation from large unsynchronized audio and text chunks
Michele Alessandrini (Università Politecnica delle Marche)
Giorgio Biagetti (Università Politecnica delle Marche)
Alessandro Curzi (Università Politecnica delle Marche)
Claudio Turchetti (Università Politecnica delle Marche)
In this paper an effective technique to train an acoustic model from large and unsynchronized audio and text chunks is presented. Given such a speech corpus, an algorithm to automatically segment each chunk into smaller fragments and to synchronize those to the corresponding text is defined. These smaller fragments are more suitable to be used in standard model training algorithms for usage in automatic speech recognition systems. The proposed approach is particularly suitable to bootstrap language models without relying neither on specialized training material nor borrowing from models trained for other similar languages. Extensive experimentation using the CMU Sphinx 4 recognizer and the SphinxTrain model generator in a setting designed for large-vocabulary continuous speech recognition shows the effectiveness of the approach.

#2 Unsupervised Testing Strategies for ASR
Brian Strope (Google)
Doug Beeferman (Google)
Alexander Gruenstein (google)
Xin Lei (Google)
This paper describes unsupervised strategies for estimating relative accuracy differences between acoustic models or language models used for automatic speech recognition. To test acoustic models, the approach extends ideas used for unsupervised discriminative training to include a more explicit validation on held out data. To test language models, we use a dual interpretation of the same process, this time allowing us to measure differences by exploiting expected `truth gradients' between strong and weak acoustic models. The paper shows correlations between supervised and unsupervised measures across a range of acoustic model and language model variations. We also use unsupervised tests to assess the non-stationary nature of mobile speech input.

#3 Acoustic Model Training with Detecting Transcription Errors in the Training Data
Gakuto KURATA (IBM Research - Tokyo)
Nobuyasu ITOH (IBM Research - Tokyo)
Masafumi NISHIMURA (IBM Research - Tokyo)
As the target of ASR has moved from clean read speech to spontaneous conversational speech, we need to prepare orthographic transcripts of spontaneous conversational speech to train acoustic models (AMs). However, it is expensive and slow to manually transcribe such speech word by word. We propose a framework to train an AM based on easy-to-make rough transcripts in which fillers and word fragments are not precisely transcribed and some transcription errors are included. By focusing on the phone duration in the result of forced alignment between the rough transcripts and the utterances, we can detect the erroneous parts in the rough transcripts. A preliminary experiment showed that we can detect the erroneous parts with moderately high recall and precision. Through ASR experiments with conversational telephone speech, we confirmed that automatic detection helped improve the performance of the AM trained with both conventional ML and state-of-the-art boosted MMI criteria.

#4 Towards Unsupervised Training of Speaker Independent Acoustic Models
Aren Jansen (Johns Hopkins University)
Kenneth Church (Johns Hopkins University)
Can we automatically discover speaker independent phoneme-like subword units with zero resources in a surprise language? There have been a number of recent efforts to automatically discover repeated spoken terms without a recognizer. This paper investigates the feasibility of using these results as constraints for unsupervised acoustic model training. We start with a relatively small set of word types, as well as their locations in the speech. The training process assumes that repetitions of the same (unknown) word share the same (unknown) sequence of subword units. For each word type, we train a whole-word hidden Markov model with Gaussian mixture observation densities and collapse correlated states across the word types using spectral clustering. We find that the resulting state clusters align reasonably well along phonetic lines. In evaluating cross-speaker word similarity, the proposed techniques outperform both raw acoustic features and language-mismatched acoustic models.

#5 Acoustic Modeling with Bootstrap and Restructuring Based on Full Covariance
Xiaodong Cui (IBM T. J. Watson Research Center)
Xin Chen (University of Missouri, Columbia)
Jian Xue (IBM T. J. Watson Research Center)
Peder A. Olsen (IBM T. J. Watson Research Center)
John R. Hershey (Mitsubishi Electric Research Laboratories)
Bowen Zhou (IBM T. J. Watson Research Center)
Bootstrap and restructuring (BSRS) has been shown in our previous work to be superior over the conventional acoustic modeling approach when dealing with low-resourced languages. This paper presents a full covariance based BSRS scheme, which is an extension of our previous work on diagonal covariance based BSRS acoustic modeling. Since full covariance provides richer structural information of acoustic model compared to its diagonal counterpart, it is advantageous for both model clustering and refinement. Therefore, in this work, full covariance is employed in BSRS to keep the structural information until the last step before being converted to diagonal covariance for practical applications. We show that using full covariance further improves the performance over diagonal covariance in the BSRS acoustic modeling framework under the same model size without increasing computational cost in decoding.

#6 An i-Vector based Approach to Acoustic Sniffing for Irrelevant Variability Normalization based Acoustic Model Training and Speech Recognition
Jian Xu (University of Science and Technology of China)
Yu Zhang (Shanghai Jiao Tong University)
Zhi-Jie Yan (Microsoft Research Asia)
Qiang Huo (Microsoft Research Asia)
This paper presents a new approach to acoustic sniffing for irrelevant variability normalization (IVN) based acoustic model training and speech recognition. Given a training corpus, a so-called i-vector is extracted from each training speech segment. A clustering algorithm is used to cluster the training i-vectors into multiple clusters, each corresponding to an acoustic condition. The acoustic sniffing can then be implemented as finding the most similar cluster by comparing the i-vector extracted from a speech segment with the centroid of each cluster. Experimental results on Switchboard-1 conversational telephone speech transcription task suggest that the i-vector based acoustic sniffing outperforms our previous Gaussian mixture model (GMM) based approach. The proposed approach is very efficient therefore can deal with very large scale training corpus on current mainstream computing platforms, yet has very low run-time cost.

#7 Log-linear Optimization of Second-order Polynomial Features with Subsequent Dimension Reduction for Speech Recognition
Muhammad Ali Tahir (RWTH Aachen University, Aachen, Germany)
Ralf Schlueter (RWTH Aachen University, Aachen, Germany)
Hermann Ney (RWTH Aachen University, Aachen, Germany)
Second order ploynomial features are useful for speech recognition because they can be used to model class specific covariance even with a pooled covariance acoustic model. Previous experiments with second order features have shown word error rate improvements. However, the improvement comes at the price of a large increase in the number of parameters. This paper investigates the discriminative training of second order features, with a subsequent dimension reduction transform to limit the increase in number of parameters. The acoustic model parameters and the transformation matrix parameters are modeled log-linearly and optimized using maximum mutual information criterion. The advantage of log-linear optimization lies in its ability to robustly combine different kinds of features. Experiments are performed for second order MFCC features on the EPPS large vocabulary task and have resulted in a decrease in word error rate.

#8 Genre Categorization and Modeling for Broadcast Speech Transcription
Qingqing Zhang (Spoken Language Processing Group, LIMSI-CNRS)
Lori Lamel (Spoken Language Processing Group, LIMSI-CNRS)
Jean-Luc Gauvain (Spoken Language Processing Group, LIMSI-CNRS)
Broadcast News (BN) speech recognition transcription has attracted research due to the challenges of the task since the mid 1990's. More recently, research has been moving towards more spontaneous broadcast data, commonly called Broadcast Conversation (BC) speech. Considering the large style difference between BN and BC genres, specific modeling of genres should intuitively result in improved system performance. In this paper BN- and BC-style speech recognition has been explored by designing genre-specific systems. In order to separate the training data, an automatic genre categorization with two novel features is proposed. Experiments showed that automatic categorization of genre labels of the training data compared favorably to the original manually specified genre labels provided with corpora. When test data sets were classified into BN or BC genres and tested by the corresponding genre-specific speech recognition systems, modest but consistent error reductions were achieved compared to the baseline genre-independent systems.

#9 Individual Error Minimization Learning Framework and its Applications to Speech Recognition and Utterance Verification
Sunghwan Shin (Center for Signal and Image Processing, Georgia Institute of Technology, Atlanta, GA, USA)
Ho-Young Jung (Speech Language Processing Team, Electronics and Telecommunications Research Institute, Daejeon, South Korea)
Biing-Hwang Juang (Center for Signal and Image Processing, Georgia Institute of Technology, Atlanta, GA, USA)
In this paper, we extend the individual recognition error minimization criteria, MDE/MIE/MSE [1] in word-level and apply them to word recognition and verification tasks, respectively. In order to effectively reduce potential errors in word-level, we expand the training token selection scheme to be more appropriate for word-level learning framework, by taking into account neighboring words and by covering internal phonemes in each training word. Then, we examine the proposed word-level learning criteria on the TIMIT word recognition task and further investigate individual rejection performance of the recognition errors in utterance verification (UV). Experimental results confirm that each of the word-level objective criteria results in primarily reducing the corresponding target error type, respectively. The rejection rates of insertion and substitution errors are also improved within MIE and MSE criteria, which lead to additional word error rate reduction after the rejection.

#10 Effective Triphone Mapping for Acoustic Modeling in Speech Recognition
Sakhia Darjaa (Slovak Academy of Sciences)
Miloš Cerňak (Slovak Academy of Sciences)
Marián Trnka (Slovak Academy of Sciences)
Milan Rusko (Slovak Academy of Sciences)
Róbert Sabo (Slovak Academy of Sciences)
This paper presents effective triphone mapping for acoustic models training in automatic speech recognition, which allows the synthesis of unseen triphones. The description of this data-driven model clustering, including experiments performed using 350 hours of a Slovak audio database of mixed read and spontaneous speech, are presented. The proposed technique is compared with tree-based state tying, and it is shown that for bigger acoustic models, at a size of 4000 states and more, a triphone mapped HMM system achieves better performance than a tree-based state tying system. The main gain in performance is due to latent application of triphone mapping on monophones with multiple Gaussian pdfs, so the cloned triphones are initialized better than with single Gaussians monophones. Absolute decrease of word error rate was 0.46% (5.73% relatively) for models with 7500 states, and decreased to 0.4% (5.17% relatively) gain at 11500 states.

#11 Analysis of Dialectal Influence in Pan-Arabic ASR
Udhyakumar Nallasamy (Language Technologies Institute, CMU)
Michael Garbus (Language Technologies Institute, CMU)
Florian Metze (Language Technologies Institute, CMU)
Qin Jin (Language Technologies Institute, CMU)
Thomas Schaaf (Multimodal Technologies, Inc.)
Tanja Schultz (Language Technologies Institute, CMU)
In this paper, we present various experiments on analyzing the influence of five dialects of the Arabic language in an Automatic Speech Recognition (ASR) system. We discuss our efforts in building the baseline ASR system and present a detailed analysis of the impact of dialects on different ASR components including the front-end and pronunciation dictionary. We use ASR phonetic decision tree as a diagnostic tool to evaluate the robustness of different front-ends to dialectal variations in the speech data. We also perform a rule-based analysis of the pronunciation dictionary, which enables us to identify dialectal words in the vocabulary and automatically generate pronunciations for unseen words.

#12 Connected Digit Recognition by Means of Reservoir Computing
Azarakhsh Jalalvand (ELIS-UGent)
fabian triefenbach (ELIS-UGent)
david verstraeten (ELIS-UGent)
jean-pierre martens (ELIS-UGent)
Most automatic speech recognition systems employ Hidden Markov Models with Gaussian mixture emission distributions to model the acoustics. There have been several attempts however to challenge this approach, e.g. by introducing a neural network (NN) as an alternative acoustic model. Although the performance of these so-called hybrid systems is actually quite good, their training is often problematic and time consuming. By using a reservoir -this is a recurrent NN with only the output weights being trainable- we can overcome this disadvantage and yet obtain good accuracy. In this paper, we propose the first reservoir-based connected digit recognition system, and we demonstrate good performance on the Aurora-2 testbed. Since RC is a new technology, we anticipate that our present system is still sub-optimal, and further improvements are possible.

#13 Large Margin - Minimum Classification Error Using Sum of Shifted Sigmoids as the Loss Function
Madhavi Ratnagiri (Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey)
Biing-Hwang Juang (School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia)
Lawrence Rabiner (Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey)
We have developed a novel loss function that embeds large-margin classification into Minimum Classification Error (MCE) training. Unlike previous efforts this approach employs a loss function that is bounded, does not require incremental adjustment of the margin or prior MCE training. It extends the Bayes risk formulation of MCE using Parzen Window estimation to incorporate large–margin classification and develops a loss function that is a sum of shifted sigmoids. Experimental results show improvement in recognition performance when evaluated on the TIDigits database.

#14 Representing Phonological features trough a two-level finite state model
Javier Mikel Olaso (Universidad del Pais Vasco)
María Inés Torres (Universidad del Pais Vasco)
Raquel Justo (Universidad del Pais Vasco)
Articulatory information has demonstrated to be useful to improve phone recognition performance in ASR systems, being Dynamic Neural Networks the most successful method to detect articulatory gestures from the speech signal. On the other hand, Stochastic Finite State Automata (SFSA) have been effectively used in many speech-input natural language tasks. In this work SFSA are used to represent phonological features. A hierarchical model able to consider sequences of acoustic observations along with sequences of phonological features is defined. From this formulation a classifier of articulatory features has been derived and then evaluated over a Spanish phonetic corpus. Experimental results show that this is a promising framework to detect and include phonological knowledge into ASR systems. Keywords: phonological features, ASR, finite state models, stochastic finite state automata, k-tss models

#15 Optimization of the Gaussian Mixture Model Evaluation on GPU
Jan Vanek (University of West Bohemia)
Jan Trmal (University of West Bohemia)
Josef V. Psutka (University of West Bohemia)
Josef Psutka (University of West Bohemia)
In this paper we present a highly optimized implementation of Gaussian mixture acoustic model evaluation algorithm. Evaluation of these likelihoods is one of the most computationally intensive parts of automatics speech recognizers but it can be well-parallelized and offloaded to GPU devices. Our approach offers significant speed-up compared to the recently published approaches, since it exploits the GPU architecture better. All the recent implementations were only targeted on NVIDIA graphics processors; programmed either in CUDA or OpenCL GPU programming frameworks. We present results for both; CUDA as well as OpenCL. Results suggest that even very large acoustic models can be utilized in real-time speech recognition engines on computers and laptops equipped with a low-end GPU. Optimization of acoustic likelihoods computation on GPU enables to use the remaining GPU resources for offloading of other compute-intensive parts of LVCSR decoder.

Technical Programme

Mon-Ses1-P4:ASR - Acoustic Models I

Mon-Ses1-P4:
ASR - Acoustic Models I