Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Sun-Ses2-S1-P:
Speech and Language Processing-Based Assistive Technologies and Health Applications

Time:Sunday 14:30 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chair:Shri Narayanan, Elmar Noeth

#1Incorporating Speech Recognition Engine Into an Intelligent Assistive Reading System for Dyslexic Students

Theologos Athanaselis (ILSP)
Stelios Bakamidis (ILSP)
Ioannis Dologlou (ILSP)
Evmorfia N. Argyriou (Department of Mathematics, School of Applied Mathematical and Physical Sciences National Technical University of Athens)
Antonis Symvonis (Department of Mathematics, School of Applied Mathematical and Physical Sciences National Technical University of Athens)

In this paper we present an approach for incorporating a state of the art speech recognition engine into a novel assistive reading system for Greek dyslexic students. This system is being developed in the framework of the AGENT-DYSL IST project, and facilitates dyslexic children in learning to read fluently. Unlike previously presented approaches, the aim of this system is to monitor the progress and perspectives of a dyslexic user and supply personalised help. The goal of this help is to gradually increase the reading capabilities of the user, gradually diminish the assistance provided, till he is able to read as a non-dyslexic reader.

#2An Investigation of Depressed Speech Detection: Features and Normalization

Nicholas Cummins (The University of New South Wales)
Julien Epps (The University of New South Wales and National ICT Australia)
Michael Breakspear (Black Dog Institute and School of Psychiatry, The University of New South Wales)
Roland Goecke (Faculty of Information Sciences and Engineering, University of Canberra, and RSCS, Australian National University)

In recent years, the problem of automatic detection of mental illness from the speech signal has gained some initial interest, however questions remaining include how speech segments should be selected, what features provide good discrimination, and what benefits feature normalization might bring given the speaker-specific nature of mental disorders. In this paper, these questions are addressed empirically using classifier configurations employed in emotion recognition from speech, evaluated on a 47-speaker depressed/neutral read sentences speech database. Results demonstrate that (1) Detailed spectral features are well suited to the task, (2) Speaker normalization provides benefits mainly for less detailed features, and (3) Dynamic information appears to provide little benefit. Classification accuracy using a combination of MFCC and formant based features approached 80% for this database.

#3Using Prosodic and Spectral Features in Detecting Depression in Elderly Males

Michelle Hewlett Sanchez (Speech Technology and Research Laboratory, SRI International and Stanford University)
Dimitra Vergyri (Speech Technology and Research Laboratory, SRI International)
Luciana Ferrer (Speech Technology and Research Laboratory, SRI International)
Colleen Richey (Speech Technology and Research Laboratory, SRI International)
Pablo Garcia (Robotics and Medical Systems Laboratory, SRI International)
Bruce Knoth (Robotics and Medical Systems Laboratory, SRI International)
William Jarrold (Center for Mind and Brain, University of California Davis)

As research in speech processing has matured, there has been much interest in paralinguistic speech processing problems including the speaker's mental and psychological health. In this study, we focus on speech features that can identify the speaker's emotional health, i.e., whether the speaker is depressed or not. We use prosodic speech measurements, such as pitch and energy, in addition to spectral features, such as formants and spectral tilt, and compute statistics of these features over different regions of the speech signal. These statistics are used as input features to a discriminative classifier that predicts the speaker's depression state. We find that with an N-fold leave-one-out cross-validation setup, we can achieve a prediction accuracy of 81.3%, where random guess is 50%.

#4Combining phonological and acoustic ASR-free features for pathological speech intelligibility assessment

Catherine Middag (Department of Electronics and Information Systems, Ghent University, Belgium)
Tobias Bocklet (Chair of Pattern Recognition, University of Erlangen-Nuremberg, Germany)
Jean-Pierre Martens (Department of Electronics and Information Systems, Ghent University, Belgium)
Elmar Nöth (Chair of Pattern Recognition, University of Erlangen-Nuremberg, Germany)

Intelligibility is widely used to measure the severity of articulatory problems in pathological speech. Recently, a number of automatic intelligibility assessment tools have been developed. Most of them use automatic speech recognizers (ASR) to compare the patient’s utterance with the target text. These methods are bound to one language and tend to be less accurate when speakers hesitate or make reading errors. To circumvent these problems, two different ASR-free methods were developed over the last few years, only making use of the acoustic or phonological properties of the utterance. In this paper, we demonstrate that these ASR-free techniques are also able to predict intelligibility in other languages. Moreover, they show to be complementary, resulting in even better intelligibility predictions when both methods are combined.

#5Speech Synthesis Parameter Generation for the Assistive Silent Speech Interface MVOCA

Robin Hofe (University of Sheffield, Uk)
Stephen R. Ell (University of Hull, UK)
Michael J. Fagan (University of Hull, UK)
James M. Gilbert (University of Hull, UK)
Phil D. Green (University of Sheffield, Uk)
Roger K. Moore (University of Sheffield, Uk)
Sergey I. Rybchenko (University of Hull, UK)

In previous publications, a silent speech interface based on permanent-magnetic articulography (PMA) has been introduced and evaluated using standard automatic speech recognition techniques. However, word recognition is a task that is computationally expensive and introduces a significant time delay between speech articulation and generation of the acoustic signal. This paper investigates a direct synthesis approach where control parameters for parametric speech synthesis are generated directly from the sensor data of the silent speech interface, without an intermediate lexical representation. Users of such a device would not be tied to the limited vocabulary of a word-based recogniser and could therefore express themselves more freely. This paper presents a feasibility study that investigates whether it is possible to infer speech synthesis parameters from PMA sensor data.

#6Computer-Assisted Disfluency Counts for Stuttered Speech

Peter A. Heeman (Oregon Health & Science University)
Andy McMillin (Artz Center)
J. Scott Yaruss (University of Pittsburgh)

In this paper, we present computer tools to help speech-language pathologists in counting disfluencies, for both real-time counts and transcript-based counts. The latter tend to be more precise and show which words are involved in each disfluency. Our approach allows real-time counts to be used as the basis for transcript-based counts. We employ automatic speech recognition to generate a word transcript (for read-speech samples), and then automatically merge the disfluency annotations with the word transcript, and have the clinician review parts of the audio file where a disfluency annotation was placed.

#7Spectral Features for Automatic Blind Intelligibility Estimation of Spastic Dysarthric Speech

Richard Hummel (Queen\'s University, Kingston, Ontario, Canada)
Wai-Yip Chan (Queen\'s University, Kingston, Ontario, Canada)
Tiago Falk (Institut National de la Recherche Scientifique, Energy, Materials, and Telecommunications, Montreal, Quebec, Canada)

In this paper, we explore the use of the standard ITU-T P.563 speech quality estimation algorithm for automatic assessment of dysarthric speech intelligibility. A linear mapping consisting of three salient P.563 internal features is proposed and shown to accurately estimate spastic dysarthric speech intelligibility. Delta-energy features are further proposed in order to characterize the atypical spectral dynamics and limited vowel space observed with spastic dysarthria. Experiments using the publicly-available Universal Access database (10 speaker patients) show that when salient delta-energy and internal P.563 features are used, correlations with subjective intelligibility ratings as high as 0.98 can be attained.

#8Extraction of narrative recall patterns for neuropsychological assessment

Emily Prud\'hommeaux (Center for Spoken Language Understanding, Oregon Health and Science University)
Brian Roark (Center for Spoken Language Understanding, Oregon Health and Science University)

Poor narrative memory is associated with a variety of neurodegenerative and developmental disorders, such as autism and Alzheimer's related dementia. Hence, narrative recall tasks are included in most standard neurological examinations. In this paper, we explore methods for automatically assessing the quality of retellings via alignment to the original narrative. Word alignments serve both to automate manual scoring and to derive other features related to narrative coherence that can be used for diagnostic classification of neurological disorders. Despite relatively high word alignment error rates, the automatic alignments provide sufficient information to achieve nearly as accurate diagnostic classification as manual scores. Furthermore, additional features that become available with alignment provide utility in classifying subject groups. While the additional features we explore here did not provide additive gains in accuracy, they point the way to the development of many potentially useful features in this domain.

#9Gesture Design of Hand-to-Speech Converter derived from Speech-to-Hand Converter based on Probabilistic Integration Model

Aki Kunikoshi (The University of Tokyo)
Yu Qiao (Shenzhen Institute of Advanced Technology)
Daisuke Saito (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

When dysarthrics try to communicate using speech, they often have to use speech synthesizers which require them to type word symbols or sound symbols. Input by this method often makes real-time communication troublesome. In this study, we are developing a novel speech synthesizer where speech is generated through hand motions rather than symbol input. By applying statistical voice conversion techniques, a hand space was mapped to a vowel space and a converter from hand motions to vowel transitions was developed. In this paper, we discuss the expansion of this system to consonant generation. In order to create the gestures for consonants, a Speech-to-Hand conversion system is developed using parallel data for vowels. Thus, we are able to automatically search for candidates for consonant gestures for a Hand-to-Speech system.

#10Powered Wheelchair Control Using Acoustic-Based Recognition of Head Gesture Accompanying Speech

Akira Sasou (National Institute of Advanced Industrial Science and Technology, AIST)

In this paper, we propose the novel interface for powered wheelchair control using the acoustic-based recognition of head gesture accompanying speech. A microphone array mounted on a wheelchair localizes the position of the user’s voice. Because the localized position of the user’s voice almost corresponds with that of the mouth, the tracking of the head movements accompanying speech can be achieved by means of the microphone array. The proposed interface does not require disabled people to wear any microphones or utter recognizable voice commands, but requires only two capabilities: the ability to move the head and the ability to utter an arbitrary sound. In our preliminary experiments, five subjects performed six kinds of head gestures accompanying speech. The head gestures of each subject were recognized using the models trained from the other subjects' data. The average recognition accuracy was 99.7 %.

#11 Analyzing training dependencies and posterior fusion in discriminant classification of apnea patients based on sustained and connected speech

Jose Luis Blanco (Universidad Politecnica de Madrid)
Ruben Fernandez (Universidad Politecnica de Madrid)
Doroteo Torre (Universidad Autonoma de Madrid)
Francisco Javier Caminero (Telefonica R&D;)
Eduardo Lopez (Universidad Politecnica de Madrid)

We present a novel approach using both sustained vowels and connected speech, to detect obstructive sleep apnoea (OSA) cases within a homogeneous group of speakers. The proposed scheme is based on state-of-the-art GMM-based classifiers, and acknowledges specifically the way in which acoustic models are trained on standard databases, as well as the complexity of the resulting models and their adaptation to specific data. Our experimental database contains a suitable number of utterances and sustained speech from healthy (i.e control) and OSA Spanish speakers. Finally, a 25.1% relative reduction in classification error is achieved when fusing continuous and sustained speech classifiers.