Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-O4:
Speech Synthesis - Unit Selection and Hybrid approaches

Time:Monday 10:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Junichi Yamagish

10:00Enriching text-to-speech synthesis using automatic dialog act tags

Vivek Kumar Rangarajan Sridhar (AT&T; Labs - Research)
Alistair Conkie (AT&T; Labs - Research)
Ann Syrdal (AT&T; Labs - Research)
Srinivas Bangalore (AT&T; Labs - Research)

We present an approach for enriching dialog based text-to-speech (TTS) synthesis systems by explicitly controlling the expressiveness through the use of dialog act tags. The dialog act tags in our framework are automatically obtained by training a maximum entropy classifier on the Switchboard-DAMSL data set, unrelated to the TTS database. We compare the voice quality produced by exploiting automatic dialog act tags with that using human annotations of dialog acts, and with two forms of reference databases. Even though the inventory of tags is different for the automatic tagger and human annotation, exploiting either form of dialog markup generates better voice quality in comparison with the reference voices in subjective evaluation.

10:20Joint Target and Join Cost Weight Training for Unit Selection Synthesis

Lukas Latacz (Vrije Universiteit Brussel)
Wesley Mattheyses (Vrije Universiteit Brussel)
Werner Verhelst (Vrije Universiteit Brussel)

One of the key challenges of optimizing a unit selection voice is obtaining suitable target and join cost weights. In this paper we investigate several strategies to train these weights automatically. Two training algorithms are tested, which are based on an acoustic distance that approximates human perception: a modified version of the well-known linear regression training and an iterative algorithm that tries to minimize a selection error. Since a single, global set of weights might not result in selecting all the time the best sequence of units, we investigate whether using multiple weight sets could improve the synthesis quality.

10:40Prominence-Based Prosody Prediction for Unit Selection Speech Synthesis

Andreas Windmann (Faculty of Linguistics and Literature, Bielefeld University, Germany)
Igor Jauk (Faculty of Technology, Bielefeld University, Germany)
Fabio Tamburini (Department of Linguistics and Oriental Studies, University of Bologna, Italy)
Petra Wagner (Faculty of Linguistics and Literature, Bielefeld University, Germany)

This paper describes the development and evaluation of a prosody prediction module for unit selection speech synthesis that is based on the notion of perceptual prominence. We outline the design principles of the module and describe its implementation in the Bonn Open Synthesis System (BOSS). Moreover, we report results of perception experiments that have been conducted in order to evaluate prominence prediction. The paper is concluded by a general discussion of the approach and a sketch of perspectives for further work.

11:00Evaluating the meaning of synthesized listener vocalizations

Sathish Pammi (DFKI GmbH)
Marc Schröder (DFKI GmbH)

Spoken and multimodal dialogue systems start to use listener vocalizations for more natural interaction. In a unit selection framework, using a finite set of recorded listener vocalizations, synthesis quality is high but the acoustic variability is limited. As a result, many combinations of segmental form and intended meaning cannot be synthesized. This paper presents an algorithm in the unit selection domain for increasing the range of vocalizations that can be synthesized with a given set of recordings. We investigate whether the approach makes the synthesized vocalizations convey a meaning closer to the intended meaning, using a pairwise comparison perception test. The results partially confirm the hypothesis, indicating that in many cases, the algorithm makes available more appropriate alternatives to the available set of recorded listener vocalizations.

11:20A Hybrid TTS Approach for Prosody and Acoustic Modules

Iñaki Sainz (Aholab Signal Processing Laboratory, University of the Basque Country)
Daniel Erro (Aholab Signal Processing Laboratory, University of the Basque Country)
Eva Navas (Aholab Signal Processing Laboratory, University of the Basque Country)
Inma Hernáez (Aholab Signal Processing Laboratory, University of the Basque Country)

Unit selection (US) TTSs generate quite natural speech but highly variable in quality. Statistical parametric (SP) systems offer far more consistent quality but reduced naturalness due to its vocoding nature. We present a hybrid approach (HA) that tries to improve the overall naturalness combining both synthesis methods. Contrary to other works, the fusion of methods is performed both in prosody and acoustic modules yielding a more robust prosody prediction and achieving greater naturalness. Objective and subjective experiments show the validity of our procedure.

11:40Uniform Speech Parameterization for Multi-form Segment Synthesis

Alexander Sorin (Speech Technologies, IBM Haifa Research Lab, Haifa, Israel)
Slava Shechtman (Speech Technologies, IBM Haifa Research Lab, Haifa, Israel)
Vincent Pollet (Text-To-Speech Research, Nuance Communications, Merelbeke, Belgium)

In multi-form segment synthesis speech is constructed by sequencing speech segments of different nature: model segments, i.e. mathematical abstractions of speech and template segments, i.e. speech waveform fragments. These multi-form segments can have shared, layered or alternate speech parameterization schemes. This paper introduces an advanced uniform speech parameterization scheme for statistical model segments and waveform segments employed in our multi-form segment synthesis system. Mel-Regularized Cepstrum derived from amplitude and phase spectra forms its basic framework. Furthermore, a new adaptive enhancement technique for model segments is presented that reduces the perceived gap in quality and similarity between model and template segments.