Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-P1:
Paralinguistic Information - Classification and Detection

Time:Monday 10:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Julia Hirschberg

#1On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation

Catharine Oertel (Trinity College Dublin)
Stefan Scherer (Ulm University)
Nick Campbell (Trinity College Dublin)

Quantifying the degree of involvement of a group of participants in a conversation is a task which humans accomplish every day, but it is something that, as of yet, machines are unable to do. In this study we first investigate the correlation between visual cues (gaze and blinking rate) and involvement. We then test the suitability of prosodic cues (acoustic model) as well as gaze and blinking (visual model) for the prediction of the degree of involvement by using a support vector machine (SVM). We also test whether the fusion of the acoustic and the visual model improves the prediction. We show that we are able to predict three classes of involvement with an reduction of error rate of 0.30 (accuracy =0.68).

#2Anger Recognition in Spoken Dialog Using Linguistic and Para-Linguistic Information

Narichika Nomoto (NTT Cyber Space Laboratories, NTT Corporation)
Masafumi Tamoto (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu Masataki (NTT Cyber Space Laboratories, NTT Corporation)
Osamu Yoshioka (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi Takahashi (NTT Cyber Space Laboratories, NTT Corporation)

This paper proposes a method to recognize anger-dialog based on linguistic and para-linguistic information in speech. Anger is classified into two types; HotAnger (agitated) and ColdAnger (calm). Conventional prosody-features based on para-linguistic can reliably recognize the former but not the latter. To recognize anger more robustly, we apply other para-linguistic cues named dialog-features which are seen in conversational interactive situations between two speakers such as turn-taking and back-channel feedback. We also utilize linguistic-features which represent conversational emotional salience. They are acquired by Pearson's chi-square test by comparing the automatically-transcribed texts between angry and neutral dialogs. Experiments show that the proposed feature combination improves the F-measure of ColdAnger and HotAnger by 26.9 points and 16.1 points against a baseline that uses only prosody.

#3Recognition of Personality Traits from Human Spoken Conversations

Alexei V. Ivanov (Department of Information Engineering and Computer Science, University of Trento, Italy)
Giuseppe Riccardi (Department of Information Engineering and Computer Science, University of Trento, Italy)
Adam J. Sporka (Czech Technical University in Prague, Czech Republic)

We are interested in understanding human personality and its manifestations in human interactions. The automatic analysis of such personality traits in natural conversation is quite complex due to the user-profiled corpora acquisition, annotation task and multidimensional modeling. While in the experimental psychology research this topic has been addressed extensively, speech and language scientists have recently engaged in limited experiments. In this paper we describe an automated system for speaker-independent personality prediction in the context of human-human spoken conversations. The evaluation of such system is carried out on the PersIA human-human spoken dialog corpus annotated with user self-assessments of the Big-Five personality traits. The personality predictor has been trained on paralinguistic features and its evaluation on five personality traits shows encouraging results for the conscientiousness and extroversion labels.

#4Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote?

Björn Schuller (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)
Zixing Zhang (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)
Felix Weninger (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)
Gerhard Rigoll (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)

We present an extensive study on the performance of data agglomeration and decision-level fusion for robust cross-corpus emotion recognition. We compare joint training with multiple databases and late fusion of classifiers trained on single databases, employing six frequently used corpora of natural or elicited emotion, namely ABC, AVIC, DES, eNTERFACE, SAL, VAM, and three classifiers i. e. SVM, Random Forests, Naive Bayes to best cover for singular effects. On average over classifier and database, data agglomeration and majority voting deliver relative improvements of unweighted accuracy by 9.0 % and 4.8 %, respectively, over single-database cross-corpus classification of arousal, while majority voting performs best for valence recognition.

#5“Would You Buy A Car From Me?” – On the Likability of Telephone Voices

Felix Burkhardt (Deutsche Telekom Laboratories)
Björn Schuller (Institute for Human-Machine Communication, Technische Universität München,)
Benjamin Weiss (Quality & Usability Lab, Technische Universität Berlin)
Felix Weninger (Institute for Human-Machine Communication, Technische Universität München,)

We researched how “likable” or “pleasant” a speaker appears based on a subset of the “Agender” database which was recently introduced at the 2010 Interspeech Paralinguistic Challenge. 32 participants rated the stimuli according to their likability on a seven point scale. An Anova showed that the samples rated are significantly different although the inter-rater agreement is not very high. Experiments with automatic regression and classification by REPTree ensemble learning resulted in a cross-correlation of up to .378 with the evaluator weighted estimator, and 67.6 % accuracy in binary classification (likable / not likable). Analysis of individual acoustic feature groups reveals that for this data, auditory spectral features seem to contribute most to reliable automatic likability analysis.

#6Automatic Identification of Salient Acoustic Instances in Couples\' Behavioral Interactions using Diverse Density Support Vector Machines

James Gibson (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Matthew Black (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)

Behavioral coding focuses on deriving higher-level behavioral annotations using observational data of human interactions. Automatically identifying salient events in the observed signal data could lead to a deeper understanding of how specific events in an interaction correspond to the perceived high-level behaviors of the subjects. In this paper, we analyze a corpus of married couples' interactions, in which a number of relevant behaviors, e.g., level of acceptance, were manually coded at the session-level. We propose a multiple instance learning approach called Diverse Density Support Vector Machines, trained with acoustic features, to classify extreme cases of these behaviors, e.g., low acceptance vs. high acceptance. This method has the benefit of identifying salient behavioral events within the interactions, which is demonstrated by comparable classification performance to traditional SVMs while using only a subset of the events from the interactions for classification.

#7Predicting Speaker Changes and Listener Responses With And Without Eye-contact

Daniel Neiberg (CTT, TMH, CSC, KTH)
Joakim Gustafson (CTT, TMH, CSC, KTH)

This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eye-contact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eye-contact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates were 60.57%, 66.35% and 62.00% for TURN-SHIFTs, LR and SC respectively.

#8Emotion Classification Using Inter- and Intra-Subband Energy Variation

Senaka Amarakeerthi (University of Aizu, Japan)
Tin Lay Nwe (Institute for Infocomm Research, Singapore)
C De Silva Liyanage (University of Brunei, Brunei)
Michael Cohen (University of Aizu, Japan)

Speech is one of the most important signals that can be usedto detect human emotions. Speech is modulated by differentemotions by varying frequency- and energy-related acoustic parameters such as pitch, energy and formants. In this paper, wedescribe research on analyzing inter- and intra-subband energyvariations to differentiate five emotions. The emotions considered are anger, fear, dislike, sadness, and neutral. We employ aTwo-Layered Cascaded Subband (TLCS) filter to study the energy variations for extraction of acoustic features. Experimentswere conducted on the Berlin Emotional Data Corpus (BEDC).We achieve average accuracy of 76.4% and 69.3% for speakerdependent and -independent emotion classifications, respectively.

#9Emotion Classification of Infants’ Cries using Duration Ratios of Acoustic Segments

Kazuki Kitahara (Nagasaki University)
Shinzi Michiwaki (Nagasaki University)
Miku Sato (Nagasaki University)
Shoichi Matsunaga (Nagasaki University)
Masaru Yamashita (Nagasaki University)
Kazuyuki Shinohara (Nagasaki University)

We propose an approach to the classification of emotion clusters using prosodic features. In our approach, we use the duration ratios of specific acoustic segments—resonant cry and silence segments—in the infants’ cries as prosodic features. We use power and pitch information to detect these segment periods and use normal distribution as a prosodic model to approximate the occurrence probability of the duration ratios of these segments. Classification experiments on two major emotion clusters are carried out. When the detection performance for the segment periods is about 75%, an emotion classification rate of 70.8% is achieved. The classification performance of our approach using the duration ratios was significantly better than that of the method using power and spectral features, thereby indicating the effectiveness of using prosodic features. Furthermore, we describe a classification method using both spectral and prosodic features with a slightly better performance (71.9%).

#10Vowels formants analysis allows straightforward detection of high arousal acted and spontaneous emotions

Bogdan Vlasenko (Cognitive Systems, IESK, OvGU)
Dmytro Prylipko (Cognitive Systems, IESK, OvGU)
David Philippou-Hübner (Cognitive Systems, IESK, OvGU)
Andreas Wendemuth (Cognitive Systems, IESK, OvGU)

The role of automatic emotion recognition from speech grows continually because of accepted importance of reacting to the emotional state of the user in human-computer interaction. Most part of state-of-the-art emotion recognition methods are based on context independent turn- and frame-level analysis. In our earlier ICME 2011 article it has been shown that robust high arousal acted emotions detection can be performed on the context dependent vowel basis. In contrast to using a HMM/GMM classification with 39-dimensional MFCC vectors, a much more convenient Neyman-Pearson criterion with the only one average F1 value is employed here. In this paper we apply the proposed method to the spontaneous emotion recognition from speech. Also, we avoid use of speaker-dependent acoustic features in favor of gender-specific ones. Finally we compare performances of acted and spontaneous emotions for different criterion threshold values.

#11Intra-, Inter-, and Cross-cultural Classification of Vocal Affect

Daniel Neiberg (Department of Speech, Music and Hearing (TMH), KTH, Stockholm, Sweden)
Petri Laukka (Department of Psychology, Stockholm University, Stockholm, Sweden)
Hillary Anger Elfenbein (Olin Business School, Washington University in St. Louis, St. Louis, MO, USA)

We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.