12thAnnual Conference of the
International Speech Communication Association
|Interspeech 2011 Florence
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Human Speech and Sound Perception I
||Place:||Valfonda 1 - Pala Congressi (Passi Perduti-Gallery)
|#1||Parallels in infants’ attention to speech articulation and to physical changes in speech-unrelated objects|
Eeva Klintfors (Dept of Linguistics, Section for Phonetics, Stockholm University)
Ellen Marklund (Dept of Linguistics, Section for Phonetics, Stockholm University)
Francisco Lacerda (Dept of Linguistics, Section for Phonetics, Stockholm University)
The mechanisms of how children develop the capacity to make use of visual cues while listening to speech are not exhaustively explored. The purpose of this study is to explore potential parallels in infants’ way to attend to speech articulation and their perception of physical changes in speech-unrelated objects. The current research questions grew out from a earlier study in which it was found that perception of speech in infants seems to be based on a match between auditory and visual prominence – as opposed to a match between sound and face. Data suggested that speech perception in infancy may function as described by Stevens power law, and two methodological supplements to test this hypotheses were made: first, a non-speech test condition was added to investigate infants’ perception of speech-unrelated objects, and second, amplitude manipulated stimuli were added to introduce systematic changes in loudness. Results showed that visually prominent stimuli were favored in the speech and non-speech conditions.
|#2||Speech events are recoverable from unlabeled articulatory data: Using an unsupervised clustering approach on data obtained from Electromagnetic Midsaggital Articulography (EMA)|
Daniel Duran (Institute for Natural Language Processing, University of Stuttgart, Germany)
Jagoda Bruni (Institute for Natural Language Processing, University of Stuttgart, Germany)
Grzegorz Dogil (Institute for Natural Language Processing, University of Stuttgart, Germany)
Hinrich Schütze (Institute for Natural Language Processing, University of Stuttgart, Germany)
Some models of speech perception/production and language acquisition make use of a quasi-continuous representation of the acoustic speech signal. We investigate whether such models could potentially profit from incorporating articulatory information in an analogous fashion.
In particular, we investigate how articulatory information represented by EMA measurements can influence unsupervised phonetic speech categorization. By incorporation of the acoustic signal and non-synthetic, raw articulatory data, we present first results of a clustering procedure, which is similarly applied in numerous language acquisition and speech perception models.
It is observed that non-labeled articulatory data, i.e. without previously assumed landmarks, perform fine clustering results.
A more effective clustering outcome for plosives than for vowels seems to support the motor view of speech perception.
|#3||Children’s recognition of their own voice: influence of phonological impairment|
Sofia Strömbergsson (Department of Speech, Music and Hearing, School of Computer Science and Communication, Royal Institute of Technology (KTH), Stockholm, Sweden)
This study explores the ability to identify the recorded voice as one’s own, in three groups of children: one group of children with phonological impairment (PI) and two groups of children with typical speech and language development; 4-5 year-olds and 7-8 year-olds. High average performance rates in all three groups suggest that these children indeed recognize their recorded voice as their own, with no significant difference between the groups. Signs indicating that children with deviant speech use their speech deviance as a cue to identifying their own voice are discussed.
|#4||Evaluation of Bone-conducted Ultrasonic Hearing-aid Regarding Transmission of Speaker Discrimination Information|
Takayuki Kagomiya (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Seiji Nakagawa (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Human listeners can perceive speech signals in a voice-modulated ultrasonic carrier from a bone-conduction stimulator, even if the listeners are patients with sensorineural hearing loss. Considering this fact, we have been developing a bone-conducted ultrasonic hearing aid (BCUHA). The purpose of this study is to evaluate the usability of BCUHA regarding transmission of speaker discrimination information. For this purpose, a prototype of speaker discrimination test was developed. The test consists of 120 pairs of 10 words spoken by 10 speakers, and examinee is requested to judge the speakers of each pair are ``same'' or ``different''. The usability of BCUHA was assessed by using the speaker discrimination test. The test was also conduced to air-conduction (AC) and cochlear implant simulator (CIsim) condition. The results show that BCUHA can transmit speaker information speaker as well as CIsim.
|#5||Impact of Different Feedback Mechanisms in EMG-based Speech Recognition|
Christian Herff (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Matthias Janke (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Michael Wand (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
This paper reports on our recent research in the feedback effects of Silent Speech. Our technology is based on surface electromyography (EMG) which captures the electrical potentials of the human articulatory muscles rather than the acoustic speech signal.
While recognition results are good for loudly articulated speech and when experienced users speak silently, novice users usually achieve far worse results when speaking silently. Since there is no acoustic feedback when speaking silently, we investigate different kinds of feedback modes: no additional feedback except the natural somatosensory feedback (like the touching of the lips), visual feedback using a mirror and indirect acoustic feedback by speaking simultaneously to a previously recorded
audio signal. In addition we examine recorded EMG data when the subject speaks audibly and silently in a loud environment to see if the Lombard effect can be observed in Silent Speech, too.
|#6||Phonotactic constraints and the segmentation of Cantonese speech|
Michael C. W. Yip (The Hong Kong Institute of Education)
Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactic information in the segmentation of Cantonese continuous speech. Because there are no legal consonant clusters occurred within individual Cantonese words, so this kind of phonotactic information of words may most likely cue native Cantonese listeners the locations of possible word boundaries in the continuous speech. Finally, the observed results from the two experiments confirmed this prediction. Together with other relevant studies, we argue that phonotactic constraint is one of the useful sources of information in segmenting Cantonese continuous speech.
|#7||Reaction time and decision difficulty in the perception of intonation|
Katrin Schneider (Institute for Natural Language Processing, University of Stuttgart, Germany)
Grzegorz Dogil (Institute for Natural Language Processing, University of Stuttgart, Germany)
Bernd Möbius (Department of Computational Linguistics and Phonetics, Saarland University, Germany)
An experiment was carried out to test the Categorical Perception as well as possible Perceptual Magnet Effects in the two boundary tone categories L% and H% in German, corresponding to statement vs. question interpretation, respectively. Additionally, reaction times (RT) were logged during all subtests to see if they support the results. Analyses revealed that RTs always increased with rising difficulty of the perceptual task, and decreased when the decision process was easy. Task-specific results showed that RT also correlated with the number of possible answers during a perceptual decision, i.e. more answer alternatives resulted in longer RT. Furthermore, female subjects generally reacted faster during all perceptual tasks, although this did not necessarily correlate with the accuracy of the results. Nevertheless, the results confirmed the usefulness of RT to support the analyses and the interpretation of perceptual data.
|#8||Processing of stress related acoustic cues as indexed by ERPs|
Ferenc Honbolygó (Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary)
Valéria Csépe (Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary)
The present paper investigated the event-related brain potential correlates of the processing of word stress related acoustic changes. We studied the processing of non-speech stimuli containing similar intensity and f0 changes as speech stimuli in a passive oddball paradigm. Contrary to our previous results using speech stimuli with a trochaic stress pattern contrasted with a iambic stress pattern, non-speech stimuli elicited a single MMN component. This result was interpreted as showing that the processing of stress information is based on speech specific mechanisms, instead of solely acoustic mechanisms.
|#9||On the relationship between perceived accentedness, acoustic similarity, and processing difficulty in foreign-accented speech|
Marijt J. Witteman (MPI for Psycholinguistics, Nijmegen, The Netherlands, International Max Planck Research School, Radboud University, Nijmegen, The Netherlands)
Andrea Weber (MPI for Psycholinguistics, Nijmegen, The Netherlands, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands)
James M. McQueen (MPI for Psycholinguistics, Nijmegen, The Netherlands, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands, Behavioural Science Instit)
Foreign-accented speech is often perceived as more difficult to
understand than native speech. What causes this potential
difficulty, however, remains unknown. In the present study,
we compared acoustic similarity and accent ratings of
American-accented Dutch with a cross-modal priming task
designed to measure online speech processing. We focused on
two Dutch diphthongs: ui and ij. Though both diphthongs
deviated from standard Dutch to varying degrees and
perceptually varied in accent strength, native Dutch listeners
recognized words containing the diphthongs easily. Thus, not
all foreign-accented speech hinders comprehension, and
acoustic similarity and perceived accentedness are not always
predictive of processing difficulties.
|#10||Perception Boundary between Single and Geminate Stops in 3- and 4-mora Japanese Words|
Shigeaki Amano (Faculty of Human Informatics, Aichi-Shukutoku University)
Yukari Hirata (Department of East Asian Languages and Literatures, Colgate University)
The perception boundary between single and geminate stops was examined by regression analyses in 3- and 4-mora Japanese words spoken at various speaking rates. It was found that the perception boundary is well predicted by a linear function with duration of stop closure and durations of word or disyllable which contained the single and geminate stops. However, we conclude that the disyllable duration was a better variable than the word duration because it provides a more consistent explanation for the perception boundary regardless of word length and speaking rate variations. The results support a relational acoustic invariance theory.
|#11||Correlation Analysis of Acoustic Features with Perceptual Voice Quality Similarity for Similar Speaker Selection|
Yusuke Ijima (NTT Cyber Space Laboratories, NTT Corporation)
Mitsuaki Isogai (NTT Cyber Space Laboratories, NTT Corporation)
Hideyuki Mizuno (NTT Cyber Space Laboratories, NTT Corporation)
This paper describes the correlations between various acoustic features and perceptual voice quality similarity. We focus on identifying the acoustic features that are correlated with voice quality similarity. First, a large-scale perceptual experiment using the voices of 62 speakers is conducted and perceptual similarity scores between each pair of speakers are acquired. Next, multiple linear regression analysis is carried out; it shows that five acoustic features exhibit high correlation to voice quality similarity. Last, we perform similar speaker selection based on multiple linear regression with the above features and moreover, assess its performance by classifying speakers based on the perceptual similarity. The results indicate that the combination of the five acoustic features in classifying speakers into two classes is effective in choosing speakers with similar voice quality; it reduces the error rate by about 44 % compared to using just the cepstrum.