|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses2-P1: Human Speech and Sound Perception II
Time: | Tuesday 13:30 |
Place: | Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) |
Type: | Poster |
Chair: | Holger Mitterer. |
#1 | Pointing Gestures do not Influence the Perception of Lexical Stress
Alexandra Jesse (Department of Psychology, University of Massachusetts, Amherst, U.S.A.) Holger Mitterer (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
We investigated whether seeing a pointing gesture influences the perceived lexical stress. A pitch contour continuum between the Dutch words “CAnon” (‘canon’) and “kaNON” (‘cannon’) was presented along with a pointing gesture during the first or the second syllable. Pointing gestures following natural recordings but not Gaussian functions influenced stress perception (Experiment 1 and 2), especially when auditory context preceded (Experiment 2). This was not replicated in Experiment 3. Natural pointing gestures failed to affect the categorization of a pitch peak timing continuum (Experiment 4). There is thus no convincing evidence that seeing a pointing gesture influences lexical stress perception.
|
#2 | Relationships between Phonetic Features and Speech Perception
Ian Cushing (University of Salford) Francis Li (University of Salford) Ken Worrall (Her Majesty’s Government Communications Centre) Jackson Tim (Her Majesty’s Government Communications Centre)
This paper concerns the relationships amongst acoustic phonetic features of speech signals, perceived vocal effort, and speech clarity. It is presented from a statistical analysis of a good number of subjective testing on an anechoic speech corpus with 5 different vocal efforts, namely hushed, normal, raised, loud, and shouted, with an aim to map objective acoustic phonetic features onto subjective ratings. Results show that listeners can differentiate vocal effort from subtle acoustic phonetic variations. There is also a correlation between clarity and vocal efforts. A regression model is further established to predict vocal effort from acoustic phonetic analysis.
|
#3 | The representation of speech in a nonlinear auditory model: time-domain analysis of simulated auditory-nerve firing patterns
Guy Brown (Department of Computer Science, University of Sheffield) Tim Jurgens (Medizinische Physik, Carl-von-Ossietzky Universitat Oldenburg) Ray Meddis (Department of Psychology, University of Essex) Matthew Robertson (Department of Computer Science, University of Sheffield) Nicholas Clark (Department of Psychology, University of Essex)
A nonlinear auditory model is appraised in terms of its ability to encode speech formant frequencies in the fine time structure of its output. It is demonstrated that groups of model auditory nerve (AN) fibres with similar interpeak intervals accurately encode the resonances of synthetic three-formant syllables, in close agreement with physiological data. Acoustic features are derived from the interpeak intervals and used as the input to a hidden Markov model-based automatic speech recognition system. In a digits-in-noise recognition task, interval-based features gave a better performance than features based on AN firing rate at every signal-to-noise ratio tested.
|
#4 | An Automatic Voice Pleasantness Classification System based on Prosodic and Acoustic Patterns of Voice Preference
Luis Pinto-Coelho (Instituto Politécnico do Porto) Daniela Braga (Microsoft, China) Miguel Sales-Dias (Microsoft Language Development Center) Carmen Garcia-Mateo (University of Vigo)
In the last few years the number of systems and devices that use voice based interaction has grown significantly. For a continued use of these systems the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how good is a voice when the application is a speech based interface. In this paper we present a new automatic voice pleasantness classification system based on prosodic and acoustic patterns of voice preference. Our study is based on a multi-language database composed by female voices. In the objective performance evaluation the system achieved a 7.3% error rate.
|
#5 | Contributions of F1 and F2 (F2’) to the perception of plosive consonants
René Carré (Laboratoire Dynamique du Langage, CNRS-Université Lyon 2, Lyon) Pierre Divenyi (Speech and Hearing Research, Veterans Affairs Northern California Health Care System, Martinez CA, USA) Willy Serniclaes (CNRS-LEAPLE, Université René Descartes, Paris) Emmanuel Ferragne (Laboratoire Dynamique du Langage, CNRS-Université Lyon 2, Lyon) Egidio Marsico (Laboratoire Dynamique du Langage, CNRS-Université Lyon 2, Lyon) Viet-Son Nguyen (Centre MICA, CNRS/UMI2954, Hanoi University of Sciences and Technology)
This study examined the contribution of F1 and F2 alone on the perception of plosive consonants in a CV context. Applying a 3-Bark spectral integration the F2 frequency was corrected for effects of proximity either to F1 or to F3, i.e., was replaced by F2’. Subjects used a two-dimensional Method of Adjustment to select the F1 and F2 consonant onset frequencies that led to a subjectively optimal percept of a predefined target CV. Results indicate that place prototypes are guided by F2 and are largely independent of F1. Nevertheless, while F2 alone is sufficient for segregating place prototypes for some consonants and vocalic contexts, it is insufficient for explaining the perception of place.
|
#6 | Auditory speech processing is affected by visual speech in the periphery
Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney, Australia) Chris Davis (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Two experiments were conducted to determine whether visual speech presented in the visual periphery affects the perceived identity of speech sounds. Auditory speech targets (vCv syllables) were presented in noise (-8 dB) with congruent or incongruent visual speech presented in full-face or upper-half face conditions. Participants’ eye-movements were monitored to assure that visual speech input occurred only from the periphery. In experiment 1 participants had only to identify what they heard. The results showed that peripherally presented visual speech (full-face) facilitated identification of AV congruent stimuli compared to the upper-face control. Likewise, visual speech reduced correct identification for the incongruent stimuli. Experiment 2 was the same as the first except that in addition participants performed a central visual task. Again significant effects of visual speech were found. These results show that peripheral visual speech affects speech recognition.
|
#7 | Visual Speech Speeds Up Auditory Identification Responses
Tim Paris (MARCS, University of Western Sydney) Jeesun Kim (MARCS, University of Western Sydney) Davis Chris (MARCS, University of Western Sydney)
Auditory speech perception is more accurate when combined with visual speech. Recent ERP studies suggest that visual speech helps 'predict' which phoneme will be heard via feedback from visual to auditory areas, with more visual salient articulations associated with greater facilitation. Two experiments tested this hypothesis with a speeded auditory identification measure. Stimuli consisted of the sounds 'apa’, 'aka' and 'ata', with matched and mismatched videos that showed the talker’s whole face or upper face (control). The percentage of matched AV videos was set at 85% in Experiment 1 and 15% in Experiment 2. Results showed that responses to matched whole face stimuli were faster than both upper face and mismatched videos in both experiments. Furthermore, salient phonemes (aPa) showed a greater reduction in reaction times than ambiguous ones (aKa). The current study provides support for the proposal that visual speech speeds up processing of auditory speech.
|
#8 | Agglomerative Hierarchical Clustering of Emotions in Speech Based on Subjective Relative Similarity
Ryoichi Takashima (Kobe University) Tohru Nagano (IBM Research - Tokyo) Ryuki Tachibana (IBM Research - Tokyo) Masafumi Nishimura (IBM Research - Tokyo)
When we humans are asked whether or not the emotions in two speech samples are in the same category, the judgment depends on the size of the target category. Hierarchical clustering is a suitable technique for simulating such perceptions by humans of relative similarities of the emotions in speech. For better reflection of subjective similarities in clustering results, we have devised a method of hierarchical clustering that uses a new type of relative similarity data based on tagging the most similar pair in sets of three samples. This type of data allowed us to create a closed-loop algorithm for feature weight learning that uses the clustering performance as the objective function. When classifying the utterances of a specific sentence in Japanese recorded at a real call center, the method reduced the errors by 15.2%.
|
#9 | Optimal Syllabic Rates and Processing Units in Perceiving Mandarin Spoken Sentences
Guangting Mai (Language Engineering Laboratory, Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong) Gang Peng (Language Engineering Laboratory, Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong)
This paper presents our investigations on the syllable-related processing during human perception of Mandarin spoken sentences. Two behavioral perception experiments were conducted employing a signal synthesis method in a previous study [1]. We found (1) a clear relationship between speech intelligibility and syllabic rates of spoken sentences and (2) significantly higher speech intelligibility of sentences acoustically segmented at sub-syllable and syllable levels than at the level beyond one syllable. We therefore revealed the optimal syllabic rates and processing units in perceiving Mandarin continuous speech and further discussed the association between our results and the possible underlying neural mechanisms in the human brain.
|
#10 | Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech
Mirjam Wester (Centre for Speech Technology Research, University of Edinburgh, United Kingdom) Hui Liang (Idiap Research Institute, Martigny, Switzerland)
This paper describes speaker discrimination experiments in which native English listeners were presented with natural speech stimuli in English and Mandarin, synthetic speech stimuli in English and Mandarin, or natural Mandarin speech and synthetic English speech stimuli. In each experiment, listeners were asked to judge whether the sentences in a pair were spoken by the same person or not. We found that the results of Mandarin/English speaker discrimination were very similar to those found in previous work on German/English and Finnish/English speaker discrimination. We conclude from this and previous work that listeners are able to discriminate between speakers across languages or across speech types, but the combination of these two factors leads to a speaker discrimination task that is too difficult for listeners to perform successfully, given the fact that the quality of across-language speaker adapted speech synthesis at present still needs to be improved.
|
|
|