Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Wed-Ses1-P2:
Systems for LVCSR and rich transcription

Time: Wednesday 10:00 Place: Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type: Poster

Chair: Diego Giuliani

#1 Improving LVCSR System Combination Using Neural Network Language Model Cross Adaptation
Xunying Liu (Cambridge University)
Mark Gales (Cambridge University)
Phil Woodland (Cambridge University)
State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems often combine outputs from multiple sub-systems developed at different sites. Cross system adaptation can be used as an alternative to direct hypothesis level combination schemes such as ROVER. The standard approach involves only cross adapting acoustic models. To fully exploit the complimentary features among sub-systems, language model (LM) cross adaptation techniques can be used. Previous research on multi-level N-gram LM cross adaptation is extended to further include the cross adaptation of neural network LMs in this paper. Using this improved LM cross adaptation framework, significant error rate gains of 4.0%-7.1% relative were obtained over acoustic model only cross adaptation when combining a range of Chinese LVCSR sub-systems used in the 2010 and 2011 DARPA GALE evaluations.

#2 TOWARDS HIGH PERFORMANCE LVCSR IN SPEECH-TO-SPEECH TRANSLATION SYSTEM ON SMART PHONES
Jian Xue (IBM T.J. Watson Research Center)
Xiaodong Cui (IBM T.J. Watson Research Center)
Gregg Daggett (IBM T.J. Watson Research Center)
Etienne Marcheret (IBM T.J. Watson Research Center)
Bowen Zhou (IBM T.J. Watson Research Center)
This paper presents the endeavors to improve the performance of large vocabulary continuous speech recognition (LVCSR) in speech-to-speech translation system on smart phones. A variety of techniques towards high LVCSR performance are investigated to achieve high accuracy and low latency given constrained resources. This includes one-pass streaming mode decoding for minimum latency, acoustic modeling with full-covariance based on bootstrap and model restructuring for improving recognition accuracy with limited training data; quantized discriminative feature space transforms and quantized Gaussian mixture model to reduce memory usage with negligible degradation on recognition accuracy. Some speed optimization methods are also discussed to increase the recognition speed. The proposed techniques evaluated on the DARPA Transtac datasets will be shown to give good overall performance under the constraints of both CPU and memory on smart phones.

#3 Deploying Google Search by Voice in Cantonese
Yun-Hsuan Sung (Google Inc.)
Martin Jansche (Google Inc.)
Pedro Moreno (Google Inc.)
We describe our efforts in deploying Google search by voice for Cantonese, a southern Chinese dialect widely spoken in and around Hong Kong and Guangzhou. We collected audio data from local Cantonese speakers in Hong Kong and Guangzhou by using our DataHound smartphone application. This data was used to create appropriate acoustic models. Language models were trained on anonymized query logs from Google Web Search for Hong Kong. Because users in Hong Kong frequently mix English and Cantonese in their queries, we designed our system from the ground up to handle both languages. We report on experiments with different techniques for mapping the phoneme inventories for both languages into a common space. Based on extensive experiments we report word error rates and web scores for both Hong Kong and Guangzhou data. Cantonese Google search by voice was launched in December 2010

#4 An Investigation on Speech Recognition for Colloquial Arabic
Sarah Al-Shareef (The University of Sheffield)
Thomas Hain (The University of Sheffield)
This paper describes a study of grapheme-based speech recognition for colloquial Arabic. An investigation of language and acoustic model configurations is carried out to illustrate the differences between colloquial and modern standard Arabic (MSA) on the example of Levantine telephone conversations. The study defines extensive and carefully crafted data sets for different dialects and studies their overlap with MSA sources. The use of grapheme models is re-investigated, and alternative configuration for acoustic models to correct obvious short-comings are tested. The recognition performance was analyzed on two levels: corpus-level and dialect-level. In addition modifications of dictionaries to allow better specification of sound patterns is explored. Overall the experiments highlight the need for higher level information on acoustic model selection.

#5 A multithreaded implementation of Viterbi decoding on Recursive Transition Networks
Fabio Brugnara (HLT research unit, FBK - Fondazione Bruno Kessler, Trento, Italy)
This paper describes the move to a multithreaded implementation of a Recursive Transition Network Viterbi speech decoder, undertaken with the objective of performing low-latency synchronous decoding on live audio streams to support online subtitling. The approach was meant to be independent on any specific hardware, in order to be easily exploitable on common computers, and portable to different operating systems. In the paper, the reference serial algorithm is presented, together with the modifications introduced to distribute most of the load to different threads by means of a dispatcher/collector thread and several worker threads. Results are presented, confirming a performance benefit in accordance with the design goals.

#6 Recurrent Neural Network based Language Modeling in Meeting Recognition
Stefan Kombrink (Brno University of Technology)
Tomas Mikolov (Brno University of Technology)
Karafiat Martin (Brno University of Technology)
Burget Lukas (Brno University of Technology)
We use recurrent neural network (RNN) based language models to improve the BUT English meeting recognizer. On the baseline setup using the original language models we decrease word error rate (WER) more than 1% absolute by n-best list rescoring and language model adaptation. When n-gram language models are trained on the same moderately sized data set as the RNN models, improvements are higher yielding a system which performs comparable to the baseline. A noticeable improvement was observed with unsupervised adaptation of RNN models. Furthermore, we examine the influence of word history on WER and show how to speed-up rescoring by caching common prefix strings.

#7 Ad-Hoc Meeting Transcription on Clusters of Mobile Devices
Michele Cossalter (Carnegie Mellon University)
Priya Sundararajan (Carnegie Mellon University)
Ian Lane (Carnegie Mellon University)
For all the time invested in meetings, very little of the wealth of information that is exchanged is preserved. In this paper, we propose a novel platform for meeting transcription using cellular phones for recognition. As most meeting participants carry cellular phones, this platform will allow meetings to be transcribed anywhere without requiring any additional infrastructure. In our proposed platform, we compare three approaches for combining audio from multiple devices: microphone selection, either at signal or feature level, and combination of decoder outputs via confusion network combination. We evaluated our approach on speech collected in a meeting environment and found that the early microphone selection at signal level obtained a 16% improvement in speech recognition accuracy compared to using a single recording device. Moreover this approach offered a comparable performance to multi-system confusion network combination, while requiring significantly lower computational cost.

#8 ROVER Enhancement with Automatic Error Detection
Kacem Abida (University of Waterloo)
Fakhri Karray (University of Waterloo)
In this paper, an approach is presented to improve the existing performance of the Recognizer Output Voting Error Reduction (ROVER) procedure used for speech decoders’ combination in automatic speech transcription. A contextual analysis is injected within the ROVER process to detect and eliminate erroneous words. This filtering is carried out through the combination of automatic error detection techniques. Experiments showed it is possible to outperform the ROVER baseline, and that combining it with error detection methods leads to an even lower Word Error Rate (WER) in the final ROVER composite output.

#9 Automatic Comma Insertion of Lecture Transcripts Based on Multiple Annotations
Yuya Akita (Kyoto University)
Tatsuya Kawahara (Kyoto University)
To enhance readability and usability of speech recognition results, automatic punctuation is an essential process. In this paper, we address automatic comma prediction based on conditional random fields (CRF) using lexical, syntactic and pause information. Since there is large disagreement in comma insertion between humans, we model individual tendencies of punctuation using annotations given by multiple annotators, and combine these models by voting and interpolation frameworks. Experimental evaluations on real lecture speech demonstrated that the combination of individual punctuation models achieves higher prediction accuracy for commas agreed by all annotators and those given by individual annotators.

Technical Programme

Wed-Ses1-P2:Systems for LVCSR and rich transcription

Wed-Ses1-P2:
Systems for LVCSR and rich transcription