Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-P2:
Applications for Learning, Education, Aged and Handicapped Persons

Time:Monday 10:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Roberto Gretter

#1Verifying Human Users in Speech-Based Interactions

Sajad Shirali-Shahreza (University of Toronto)
Yashar Ganjali (University of Toronto)
Ravin Balakrishnan (University of Toronto)

Verifying that a live human is interacting with an automated speech based system is needed in some applications such as biometric authentication. In this paper, we present a method to verify that the user is human. Simply stated, our method asks the user to repeat a sentence. The reply is analyzed to verify that it is the requested sentence and said by a human, not a speech synthesis system. Our method is taking advantage of both speech synthesizer and speech recognizer limitations to detect computer programs, which is new, and potentially more accessible, way to develop CAPTCHA systems. Using an acoustic model trained on voices of over 1000 users, our system can verify the user’s answer with 98% accuracy and with 80% success in distinguishing humans from computers.

#2Automatic Assessment of Prosody in High-Stakes English Tests

Jian Cheng (Knowledge Technologies, Pearson)

Prosody can be used to infer whether or not candidates fully understand a passage they are reading aloud. In this paper, we focused on automatic assessment of prosody in a read-aloud section for a high-stakes English test. A new method was proposed to handle fundamental frequency (F0) of unvoiced segments that significantly improved the predictive power of F0. The k-means clustering method was used to build canonical contour models at the word level for F0 and energy. A direct comparison between the candidate’s contours and ideal contours gave a strong prediction of the candidate’s human prosody rating. Duration information at the phoneme level was an even better predictive feature. When the contours and duration information were combined, the correlation coefficient r = 0.80 was obtained, which exceeded the correlation between human raters (r = 0.75). The results support the use of the new methods for evaluating prosody in high-stakes assessments.

#3Improvement of Segmental Mispronunciation Detection with Prior Knowledge Extracted from Large L2 Speech Corpus

Dean Luo (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong)
Xuesong Yang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong)
Lan Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong)

In this paper, we propose novel methods that utilize prior mispronunciation knowledge extracted from large L2 speech corpus to improve segmental mispronunciation detection performance. Mispronunciation rules are categorized and the occurrence frequency of each error type is calculated from phone-level annotation of the corpora. Based on these rules and statistics of mispronunciations, we construct extended pronunciation lexicons with prior probabilities that reflect how likely each type of error might occur as language models for ASR. A 2-pass confusion network based strategy, which uses posterior proverbiality scores with optimal thresholds estimated from the L2 speech corpus, is introduced to refine phone recognition results. Experimental results show that the proposed methods can improve mispronunciation detection performance rather significantly.

#4Off-Topic Detection in Automated Speech Assessment Applications

Jian Cheng (Knowledge Technologies, Pearson)
Jianqiang Shen (Knowledge Technologies, Pearson)

Automated L2 speech assessment applications need some mechanism for validating the relevance of user responses before providing scores. In this paper, we discuss a method for off-topic detection in an automated speech assessment application: a high-stakes English test (PTE Academic). Different from traditional topic detection techniques that use characteristics of text alone, our method mainly focused on using the features derived from speech confidence scores. We also enhanced our off-topic detection model by incorporating other features derived from acoustic likelihood, language model likelihood, and garbage modeling. The final combination model significantly outperformed classification from any individual feature. When fixing the false rejection rate at 5% in our test set, we achieved a false acceptance rate of 9.8%. a very promising result.

#5Towards Context-dependent Phonetic Spelling Error Correction in Children’s Freely Composed Text for Diagnostic and Pedagogical Purposes

Sebastian Stüker (Karlsruhe Institute of Technology)
Johanna Fay (Pädagogische Hochschule Karlsruhe)
Kay Berkling (Karlsruhe Institute of Technology)

Reading and writing are core competencies of any society. In Germany, international and national comparative studies such as PISA or IGLU have shown that around 25% of German school children do not reach the minimal competence level necessary to function effectively in society by the age of 15. Automized diagnosis and spelling tutoring of children can play an important role in raising their orthographic level of competence. One of several necessary steps in an automatic spelling tutoring system is the automatic correction of achieved text that was freely written by children and contains errors. Based on the common knowledge that children in the first years of school write as they speak, we propose a novel, context-sensitive spelling correction algorithm that uses phonetic similarities, in order to achieve this step. We evaluate our approach on a test set of texts written by children and how that it outperforms Hunspell, a well established isolated error correction program used in text processors.

#6Factored Translation Models for improving a Speech into Sign Language Translation System

Verónica López-Ludeña (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Rubén San-Segundo (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Ricardo Cordoba (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Javier Ferreiros (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Juan Manuel Montero (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
José Manuel Pardo (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)

This paper proposes the use of Factored Translation Models (FTMs) for improving a Speech into Sign Language Translation System. These FTMs allow incorporating syntactic-semantic information during the translation process. This new information permits to reduce significantly the translation error rate. This paper also analyses different alternatives for dealing with the non-relevant words. The speech into sign language translation system has been developed and evaluated in a specific application domain: the renewal of Identity Documents and Driver’s License. The translation system uses a phrase-based translation system (Moses). The evaluation results reveal that the BLEU has improved from 69.11% to 73.92% and the mSER has been reduced from 30.56% to 24.81%.

#7Formant maps in Hungarian vowels – online data inventory for research, and education

Kálmán Abari (Institute of Psychology, University of Debrecen, Hungary)
Zsuzsanna Zsófia Rácz (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary)
Gábor Olaszy (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary)

This paper describes a project for creating an online system for studying the main formant movements of Hungarian vowels in spoken words, as a function of their sound environment. The speech material and the formant data corresponding to the vowels combined present research data for many other purposes as well. For efficient presentation of the data and to allow multilevel comparisons among formant features an online solution was developed. The inventory data can be regarded as a reference, because of the strict conformity between the defined formant data and the formants of the spoken words. A two-step manual verification phase after the completion of automatic formant tracking was performed. The on-line query ensures quick and wide spread studying of formant maps in vowels. The database is available at: “https://hungarianspeech.tmit.bme.hu/formant”. Index Terms: formant map, Hungarian vowels, live measurements, coarticulation, evaluation material.

#8Automatic Subtitling of the Basque Parliament Plenary Sessions Videos

Germán Bordel (Department of Electricity and Electronics,University of the Basque Country, Spain)
Slvia Nieto (Department of Electricity and Electronics,University of the Basque Country, Spain)
Mikel Penagarikano (Department of Electricity and Electronics,University of the Basque Country, Spain)
Luis Javier Rodriguez-Fuentes (Department of Electricity and Electronics,University of the Basque Country, Spain)
Amparo Varona (Department of Electricity and Electronics,University of the Basque Country, Spain)

Subtitling of video contents offered in the web by Spanish administration agencies is required by law, allowing people with hearing impairments to follow them. The automatic video subtitling system described in this paper has been developed to be applied on the videos that the Basque Parliament posts in its web (https://www.parlamentovasco.euskolegebiltzarra.org/), and is running from September 2010. A specific characteristic of this system is the use of a simple phonetic decoder based on a joint selection of Basque and Spanish phone models, since it is not unusual for parliamentarians to make use of a mixing of the two languages. The system uses the manually transcribed Session Diaries (about verbatim but containing some errors) as subtitles, synchronizing text and voice by means of an acoustic decoder, a multilingual orthographic-phonetic transcriber and a very-large-symbol-sequence aligner.

#9Generating Animated Pronunciation from Speech through Articulatory Feature Extraction

Yurie Iribe (Information and Media Center, Toyohashi University of Technology, Japan)
Silasak Manosavanh (Graduate School of Engineering, Toyohashi University of Technology, Japan)
Kouichi Katsurada (Graduate School of Engineering, Toyohashi University of Technology, Japan)
Ryoko Hayashi (Graduate School of Intercultural Studies, Kobe University, Japan)
Chunyue Zhu (School of Language and Communication, Kobe University, Japan)
Tsuneo Nitta (Graduate School of Engineering, Toyohashi University of Technology, Japan)

We automatically generate CG animations to express the pronunciation movement of speech through articulatory feature (AF) extraction to help learn a pronunciation. The proposed system uses MRI data to map AFs to coordinate values that are needed to generate the animations. By using magnetic resonance imaging (MRI) data, we can observe the movements of the tongue, palate, and pharynx in detail while a person utters words. AFs and coordinate values are extracted by multi-layer neural networks (MLN). Specifically, the system displays animations of the pronunciation movements of both the learner and teacher from their speech in order to show in what way the learner’s pronunciation is wrong. Learners can thus understand their wrong pronunciation and the correct pronunciation method through specific animated pronunciations. Experiments to compare MRI data with the generated animations confirmed the accuracy of articulatory features. Additionally, we verified the effectiveness of using AF to generate animation.

#10A Tale of Two Tasks: Detecting Children’s Off-Task Speech in a Reading Tutor

Wei Chen (Language Technologies Institute, School of Computer Science, Carnegie Mellon University, USA)
Jack Mostow (Project LISTEN, School of Computer Science, Carnegie Mellon University, USA)

How can an automated tutor detect children’s off-task utterances? To answer this question, we trained SVM classifiers on a corpus of 495 children’s 36,492 computer-assisted oral reading utterances. On a test set of 651 utterances by 10 held-out readers, the classifier correctly detected 88% of off-task utterances and misclassified 17% of on-task utterances as off-task. As a test of generality, we applied the same classifier to 20 children’s 410 responses to vocabulary questions. The classifier detected 84% of off-task utterances but misclassified 57% of on-task utterances. Acoustic and lexical features helped detect off-task speech in both tasks.

#11The problems encountered by Japanese EL2 with English short vowels as illustrated on the 3D Vowel Chart

Toshiko Isei-Jaakkola (Chubu University)
Takatoshi Naka (Chukyo University)
Keikichi Hirose (The University of Tokyo)

In this study we attempted to illustrate to what extent Japanese university students who study English immediately after their enrolment have acquired English short vowels using graphs and a three-dimensional (= 3D) vowel chart, and thus to clarify what their problems are while simultaneously producing American English short vowels. There was a prediction that Japanese learners of English (= JEL2) have weakness in lip-rounding and protrusion since there are no such articulatory movements in Japanese vowels. This was clarified while observing F2 and F3. JEL2 have problems with simultaneous in lip movements, the jaw movements in general in this case. Also we found that there was a difference between female and male JEL2. As far as this experiment is concerned, female JEL2’s tongue and jaw movement (F2) is less stable than males’. Moreover, it may be confirmed that the 3D Vowel Chart may be more useful for EL2 than the graph.

#12Automatic generation of listening comprehension learning material in European Portuguese

Thomas Pellegrini (INESC-ID)
Rui Correia (IST)
Isabel Trancoso (INESC-ID / IST)
Jorge Baptista (Universidade do Algarve)
Nuno Mamede (INESC-ID / IST)

The goal of this work is the automatic selection of materials for a listening comprehension game. We would like to select automatically transcribed sentences from recent broadcast news corpora, in order to gather material for the games with little human effort. The recognized words are used as the ground solution of the exercises, thus sentences with misrecognitions need to be filtered out. Our experiments confirmed the feasibility of the filter chain that automatically selects sentences, although harder confidence thresholds may be needed. Together with the correct words, wrong candidates, namely distractors, are also needed to build the exercises. Two techniques of distractor generation are presented, either based on the confusion networks produced by the recognizer, or on phonetic distances. The experiments confirmed the complementarity of both approaches.

#13Candidate Generation for ASR Output Error Correction Using a Context-Dependent Syllable Cluster-Based Confusion Matrix

Chao-Hong Liu (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
Chung-Hsien Wu (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
David Sarwono (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
Jhing-Fa Wang (Department of Electrical Engineering, National Cheng Kung University, Taiwan)

Error correction techniques have been proposed in the applications of language learning and spoken dialogue systems for spoken language understanding. These techniques include two consecutive stages: the generation of correction candidates and the selection of correction candidates. In this study, a Context-Dependent Syllable Cluster (CD-SC)-based Confusion Matrix is proposed for the generation of correction candidates. A Contextual Fitness Score, measuring the sequential relationship to the neighbors of the candidate, is proposed for corrected syllable sequence selection. Finally, the n-gram language model is used to determine the final word sequence output. Experiments show that the proposed method improved from 0.742 to 0.771 in terms of BLEU score as compared to the conventional speech recognition mechanism.

#14SEMI-SUPERVISED TREE SUPPORT VECTOR MACHINE FOR ONLINE COUGH RECOGNITION

Thai Hoa Huynh (A-STAR * Institute for Infocomm Research, Singapore)
Vu An Tran (A-STAR * Institute for Infocomm Research, Singapore)
Huy Dat Tran (A-STAR * Institute for Infocomm Research, Singapore)

Pneumonia and asthma are among the top causes of death worldwide with 300 million people suffered. In the year 2005, 255,000 people died only because of asthma. Good controlling requires both proper medication and continual monitoring over days and nights. In this paper, we introduce a novel classifier, namely Semi-Supervised Tree Support Vector Machine, to target the problem of cough detection and monitoring. It will adaptively analyze the distribution of samples’ confidence metrics, automatically select the most informative samples and re-train the core Tree SVM classifier inside accordingly. Besides, we also introduce a new way to build Tree SVM, based on Fisher Linear Discriminant (FLD) analytic. All are meant to improve final system performance, and our proposed classifier has really demonstrated good improvement over conventional method; validated on a database consists of comprehensive body-sounds, recorded with wearable contact microphone.