|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Sun-Ses2-P4: Spoken Dialogue & Spoken Language Understanding Systems
Time: | Sunday 13:30 |
Place: | Faenza 2 - Pala Congressi (Passi Perduti-Gallery) |
Type: | Poster |
Chair: | Steve Renals |
#1 | Multi-view approach for speaker turn role labeling in TV Broadcast News shows
Geraldine Damnati (France Telecom - Orange Labs) Delphine Charlet (France Telecom - Orange Labs)
Speaker role recognition in TV Broadcast News shows is addressed in this paper. Speaker turns are assigned a role among anchor, reporter and other. A multi-view approach is proposed exploiting the complementarities of lexical cues obtained from Automatic Speech Recognition output and acoustical cues obtained from speech signal analysis. Early and late fusions are compared. 90.1% classification accuracy is obtained on automatically segmented speaker turns for a 6.5 hours test corpus of 14 shows mixing news and conversational speech. Further analyses are provided for other speaker turns showing interesting perspectives towards finer-grained speaker role characterization.
|
#2 | Evaluation of an Integrated Authoring Tool for Building Advanced Question-Answering Characters
Sudeep Gandhe (USC Institute for Creative Technologies) Michael Rushforth (University of Texas at San Antonio) Priti Aggarwal (USC Institute for Creative Technologies) David Traum (USC Institute for Creative Technologies)
We present the evaluation of an integrated authoring tool for rapid prototyping of dialogue systems. These dialogue systems are designed to support virtual humans engaging in advanced question-answering dialogues, such as for training tactical questioning skills. The tool was designed to help non- experts, who may have little or no knowledge of linguistics or computer science, build virtual characters that can play the role of an interviewee. The tool has been successfully used by several different non-experts to create a number of virtual characters used successfully for both training and human subjects testing. We report on experiences with seven such characters, whose development time was as little as two weeks including concept development and a round of user testing.
|
#3 | Towards Unsupervised Spoken Language Understanding: Exploiting Query Click Logs for Slot Filling
Gokhan Tur (Microsoft Speech Labs | Microsoft Research) Dilek Hakkani-Tür (Microsoft Speech Labs | Microsoft Research) Dustin Hillard (Microsoft Speech Labs) Asli Celikyilmaz (Microsoft Speech Labs)
In this paper, we present a novel approach to exploit user
queries mined from search engine query click logs to bootstrap or
improve slot filling models for spoken language understanding. We
propose extending the earlier gazetteer population techniques to mine
unannotated training data for semantic parsing. The automatically
annotated mined data can then be used to train slot specific parsing
models. We show that this method can be used to bootstrap slot filling
models and can be combined with any available annotated data to
improve performance. Furthermore, this approach may eliminate the need
for populating and maintaining in-domain gazetteers, in addition to
providing complementary information if they are already available.
|
#4 | Web-enhanced Contents Retrieval for Information Access Dialogue System
Donghyeon Lee (Department of Computer Science and Engineering, POSTECH, South Korea) Cheongjae Lee (Academic Center for Computing and Media Studies, Kyoto University, Japan) Minwoo Jeong (Department of Computer Science and Engineering, POSTECH, South Korea) Kyungduk Kim (Department of Computer Science and Engineering, POSTECH, South Korea) Seokhwan Kim (Department of Computer Science and Engineering, POSTECH, South Korea) Junhwi Choi (Department of Computer Science and Engineering, POSTECH, South Korea) Gary Geunbae Lee (Department of Computer Science and Engineering, POSTECH, South Korea)
We consider the problem of contents retrieval with complex query for information access dialogue system. To deal with complex query, dialogue system used to attain deep semantic processing such as full semantic parsing and ontology-based reasoning. However, they require a large amount of semantic annotation and domain expert knowledge that are often very expensive to create, and thus they have been limited in practice. In this paper, we present a simple alternative method that enhances vector space model-based contents retrieval with web search engine. For robust contents retrieval, our model expands vector spaces with web documents to capture underlying co-occurrence patterns between the query and contents. One merit of the proposed approach is that it does not require heavy semantic processing, and therefore it results in efficient content retrieval. We demonstrate that our method is beneficial in an electronic program guide dialogue system.
|
#5 | Uncertainty management for on-line optimisation of a POMDP-based large-scale spoken dialogue system
Lucie Daubigney (Supelec) Milica Gasic (Cambridge University) Senthilkumar Chandramohan (Supelec - UAPV) Matthieu Geist (Supelec) Olivier Pietquin (Supelec - UMI 2958 (CNRS GeorgiaTech)) Steve Young (Cambridge University)
The optimization of dialogue policies using reinforcement learning (RL) is now an accepted part of the state of the art in spoken dialogue systems (SDS). Yet, it is still the case that the commonly used training algorithms for SDS require a large number of dialogues and hence most systems still rely on artificial data generated by a user simulator. Optimization is therefore performed off-line before releasing the system to real users. Gaussian Processes (GP) for RL have recently been applied to dialogue systems. One advantage of GP is that they compute an explicit measure of uncertainty in the value function estimates computed during learning. In this paper, a class of novel learning strategies is described which use
uncertainty to control exploration on-line. Comparisons between several exploration schemes show that significant improvements to learning speed can be obtained and that rapid and safe online optimisation is possible, even on a complex task.
|
#6 | Detection of task-incomplete dialogs based on utterance-and-behavior tag N-gram for spoken dialog systems
Sunao Hara (Nagoya universityGraduate School of Information Science, Nagoya University, Japan) Norihide Kitaoka (Graduate School of Information Science, Nagoya University, Japan) Kazuya Takeda (Graduate School of Information Science, Nagoya University, Japan)
We propose a method of detecting ``task incomplete'' dialogs in spoken dialog systems using N-gram-based dialog models.
We used a database created during a field test in which inexperienced users used a client-server music retrieval system with a spoken dialog interface on their own PCs.
In this study, the dialog for a music retrieval task consisted of a sequence of user and system tags that related their utterances and behaviors.
The dialogs were manually classified into two classes: the dialog either completed the music retrieval task or it didn't.
We then detected dialogs that did not complete the task, using N-gram probability models or a Support Vector Machine with N-gram feature vectors trained using manually classified dialogs.
Off-line and on-line detection experiments were conducted on a large amount of real data, and the results show that our proposed method achieved good classification performance.
|
#7 | Shrinkage Based Features for Natural Language Call-Routing
Ruhi Sarikaya (IBM T.J. Watson Research Center) Stanley F. Chen (IBM T.J. Watson Research Center) Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
The feature set used with a classifier can have a large impact on classification performance. This paper presents a set of shrinkage-based features for Maximum Entropy and other classifiers in the exponential family. These features are inspired by the exponential class-based language model, Model M. We motivate the use of these features for the task of text classification and evaluate them on a natural language call routing task. The proposed features along with a new word clustering method result in significant improvements in action classification accuracy over typical word-based features, particularly for small amounts of training data.
|
#8 | Clustering with modified cosine distance learned from constraints
Leonid Rachevsky (IBM) Dimitri Kanevsky (IBM) Ruhi Sarikaya (IBM) Bhuvana Ramabhadran (IBM)
In this paper we present a modified cosine similarity metric that helps to make features more discriminative.
The new metric is defined via various linear transformations of the original feature space to a space in which these samples are better separated.
These transformations are learned from a set of constraints representing available domain knowledge by solving
related optimization problems.
We present results on two natural language call routing datasets that show significant improvements ranging from 3\% to 5\% absolute
in the purity of clusters obtained in an unsupervised fashion.
|
#9 | Using Speaker ID to Discover Repeat Callers to a Spoken Dialog System
Andrew Fandrianto (Carnegie Mellon University) Brian Langner (Carnegie Mellon University) Alan W Black (Carnegie Mellon University)
This paper describes using speaker ID techniques to identify repeat callers in a spoken dialog system, using only acoustic features. Often it is useful to know if a dialog user is a novice or is experienced, and it can be the case that identifying data such as Caller ID is either unreliable or unavailable. Our approach attempts to remedy this by determining user identity in a dialog session using the acoustic information in the dialog. We optimize the audio content of each call by removing artifacts not relevant to modeling speech. This technique is applied to finding consecutive callers and creating unique user identities over all calls over a larger time frame, with the aim of tuning or adapting the dialog system based on the user identity. Our results show that the technique is effective in recognizing consecutive callers and in identifying a unique user identities in a large set of calls.
|
#10 | Semantic graph clustering for POMDP-based spoken dialog systems
Florian Pinault (LIA University of Avignon) Fabrice Lefèvre (LIA University of Avignon)
Dialog managers (DM) in spoken dialogue systems make decisions in highly uncertain conditions, due to errors from the speech recognition and spoken language understanding (SLU) modules. In this work a framework to interface efficient probabilistic modeling for both the SLU and the DM modules is described and investigated. Thorough representation of the user semantics is inferred by the SLU in the form of a graph of frames and, complemented with some contextual information, is mapped to a summary space in which a stochastic POMDP dialogue manager can perform planning of actions taking into account the uncertainty on the current dialogue state. Tractability is ensured by the use of an intermediate summary space. Also to reduce the development cost of SDS an approach based on clustering is proposed to automatically derive the master-summary mapping function. A preliminary implementation is presented in the {\sc Media} domain (touristic information and hotel booking) and tested with a simulated user.
|
#11 | Learning Place-Names from Spoken Utterances and Localization Results by Mobile Robot
Ryo Taguchi (Nagoya Institute of Technology) Yuji Yamada (Nagoya Institute of Technology) Koosuke Hattori (Nagoya Institute of Technology) Taizo Umezaki (Nagoya Institute of Technology) Masahiro Hoguro (Chubu University) Naoto Iwahashi (National Institute of Information and Communications Technology) Kotaro Funakoshi (Honda Research Institute Japan Co., Ltd.) Mikio Nakano (Honda Research Institute Japan Co., Ltd.)
This paper proposes a method for the unsupervised learning of place-names from pairs of a spoken utterance and a localization result, which represents a current location of a mobile robot, without any priori linguistic knowledge other than a phoneme acoustic model. In previous work, we have proposed a lexical learning method based on statistical model selection. This method can learn the words that represent a single object, such as proper nouns, but cannot learn the words that represent classes of objects, such as general nouns. This paper describes improvements of the method for learning both a phoneme sequence of each word and a distribution of objects that the word represents.
|
#12 | Active Learning for Dialogue Act Classification
Björn Gambäck (SICS, Swedish Institute of Computer Science AB / Norwegian University of Science and Technology) Fredrik Olsson (SICS, Swedish Institute of Computer Science AB) Oscar Täckström (SICS, Swedish Institute of Computer Science AB)
Active learning techniques were employed for classification of dialogue acts over two dialogue corpora, the English human-human
Switchboard corpus and the Spanish human-machine Dihana corpus. It is shown clearly that active learning improves on a baseline
obtained through a passive learning approach to tagging the same data sets. An error reduction of 7% was obtained on Switchboard, while a factor 5 reduction in the amount of labeled data needed for classification was achieved on Dihana.
The passive Support Vector Machine learner used as baseline in itself significantly improves the state of the art in dialogue act
classification on both corpora. On Switchboard it gives a 31% error reduction compared to the previously best reported result.
|
#13 | Speaker Role Recognition using question detection and characterization
Thierry Bazillon (Aix Marseille Universite , LIF-CNRS) Benjamin Maza (Universite d\'Avignon , LIA-CERI) Mickael Rouvier (Universite d\'Avignon , LIA-CERI) Frederic Bechet (Aix Marseille Universite , LIF-CNRS) Alexis Nasr (Aix Marseille Universite , LIF-CNRS)
Speech Data Mining is an area of research dedicated to characterize audio streams containing speech of one or more speakers, using descriptors related to the form and the content of the speech signal. Besides the automatic word transcription process, information about the type of audio stream and the role and identity of speakers is also crucial to allow complex queries. In this framework we present a study done on broadcast conversations on how speakers express questions in conversations, starting with the initial intuition that the surface form of the questions uttered is a signature of the role of the speakers in the conversation (anchor, guest, expert, etc.). By classifying these questions thanks to a set of labels and using this information in addition to the commonly used descriptors to classify users' role in broadcast conversations, we want to improve the role classification accuracy and validate our initial intuition.
|
#14 | Learning Score Structure from Spoken Language for A Tennis Game
QIANG HUANG (University of East Anglia) Stephen Cox (University of East Anglia)
We describe a novel approach to inferring the scoring rules of a tennis game by analysing the chair umpire's speech. In a tennis match, the chair umpire, amongst other tasks, announces the scores. Hence his or her speech is the key resource for inferring the scoring rules of tennis. In this work, the learning procedure consists of two steps: speech recognition followed by rule inference. For speech recognition, we use a two coupled language models one for words and one for scores. The first makes use of the internal structure of a score, the second, the dependency of a score on the previous score. For rule inference, we utilize a multigram model to segment the recognised score streams into variable-length score sequences, each of them corresponding to a game in a tennis match. The approach is applied to four complete tennis matches, and shows both enhanced recognition performance, and a promising
approach to inferring the scoring rules of the game.
|
#15 | Semi-automated classifier adaptation for natural language call routing
Silke M. Witt (West)
Commercial spoken dialogue systems traditionally are static in the sense that once deployed, these applications only get updated as periodically. Also, the creation of classifiers in call routing applications requires expensive manual annotation of caller intents. This work introduces a process to semi-automatically annotate new data and to use the new annotations to update the training corpus to iteratively improve classification performance. The new method combine a multiple classifier voting schema and an iterative boosting mechanism to continually update the classifier with the new automatically annotated data. This method was tested with 6 weeks’ worth of data from a live system. It is shown that with this approach about 93% of all new utterances can be automatically annotated. Using the iterative boosting approach increased the size of the training corpus by about 6% per iteration while at the same time slightly increasing the classification accuracy.
|
#16 | Interactional Style Detection for Versatile Dialogue Response Using Prosodic and Semantic Features
Wei-Bin Liang (Dept. of CSIE, NCKU, Tainan, Taiwan) Chung-Hsien Wu (Dept. of CSIE, NCKU, Tainan, Taiwan) Chih-Hung Wang (Dept. of CSIE, NCKU, Tainan, Taiwan) Jhing-Fa Wang (Dept. of Electrical Engineering, NCKU, Tainan, Taiwan)
This work presents an approach to interactional style (IS) detection for versatile responses in spoken dialogue systems (SDSs). Since speakers generally express their intents in different styles, the responses of an SDS should be versatile instead of invariable responses. Moreover, the IS of dialogue turns can be affected by dialogue topics and speakers’ emotional states. In this work, three base-level classifiers are employed for preliminary detection, latent Dirichlet allocation for dialogue topic categorization, support vector machine for prosody-based emotional state identification and maximum entropy for semantic label-based emotional state identification. Finally, an artificial neural network is adopted for IS detection considering the scores estimated from the aforementioned classifiers. To evaluate the proposed approach, an SDS in a chatting domain was constructed for evaluation. The evaluation results revealed that the performance of IS detection can achieve 82.67% accuracy.
|
#17 | Quality aspects of multimodal dialog systems: identity, stimulation and success
Christine Kuehnel (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany) Benjamin Weiss (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany) Matthias Schulz (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany) Sebastian Moeller (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
So far, not much is known on the relationship of quality aspects of multimodal dialog systems. This paper aims at closing this gap by analyzing the influence of input and output modalities on the systems' usability. The underlying study has been carried out with a smart-home system offering speech, gesture and touch as well as the combination of these three for input and a speech-to-text system, a TV screen and a smartphone screen for output.
The results indicate that the usability of a multimodal system is composed of hedonic and pragmatic aspects. The hedonic aspects are influenced by the identity transported by the output channels and the stimulation of the input modalities. A measure for task success was sufficient to describe the pragmatic aspect.
|
|
|