|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-S3: Speech Processing Tools
Time: | Wednesday 10:00 |
Place: | Donatello (Room Onice) - Pala Congressi - Ground Floor |
Type: | Poster |
Chair: | Christoph Draxler |
#1 | Speech Processing Tools - An Introduction to Interoperability
Christoph Draxler (Institute of Phonetics and Speech Processing, LMU Munich) Toomas Altosaar (Aalto University School of Science and Technology, Espoo, Finland) Sadaoki Furui (Dept. of Computer Science, Tokyo Institute of Technology, Japan) Mark Liberman (Dept. of Linguistics, University of Pennsylvania, Philadelphia PA, USA) Peter Wittenburg (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Research and development in the field of spoken language depends critically on the existence of software tools. A large range of excellent tools have been developed and are widely used today. Most tools were developed by individuals who recognized the need for a given tool, had the necessary conceptual and programming skills, and were deeply rooted in the application field, namely spoken language.
Excellent tools are a prerequisite to research. However, tool developers rarely receive academic recognition for their efforts. Journals, conferences and funding agencies are interested in the results of the work on a research question while the tools developed to achieve these results are of less interest.
The Interspeech 2011 special event on speech processing tools aims to provide a forum for tool developers to improve their academic visibility and thus enhance their motivation to continue developing the software needed by the community.
|
#2 | EasyAlign: an automatic phonetic alignment tool under Praat
Jean-Philippe Goldman (University of Geneva)
We provide a user-friendly automatic phonetic alignment tool for continuous speech, named EasyAlign. It is developed as a plug-in of Praat, the popular speech analysis software, and it is freely available. Its main advantage is that one can easily align speech from an orthographic transcription. It requires a few minor manual steps and the result is a multi-level annotation within a TextGrid composed of phonetic, syllabic, lexical and utterance tiers. Evaluation showed that the performances of this HTK-based aligner compare to human alignment and to other existing alignment tools. It was originally fully available for French, English. Community’s interests for its extension to other languages helped to develop a straight-forward methodology to add languages. While Spanish and Taiwan Min were recently added, other languages are under development.
|
#3 | MTRANS: A multi-channel, multi-tier speech annotation tool
Julián Villegas (Ikerbasque (Basque Science Foundation), Spain) Martin Cooke (Ikerbasque (Basque Science Foundation), Spain and Language and Speech Laboratory, Universidad del Pais Vasco, Spain) Vincent Aubanel (Ikerbasque (Basque Science Foundation), Spain) Marco A. Piccolino-Boniforti (Dept. Linguistics, Univ. Cambridge, UK)
MTRANS, a freely available tool for annotating multi-channel speech is presented. This software tool is designed to provide visual and aural display flexibility required for transcribing multi-party conversations; in particular, it eases the analysis of speech overlaps by overlaying waveforms and spectrograms (with controllable transparency), and the mapping from media channels to annotation tiers by allowing arbitrary associations between them. MTRANS supports interoperability with other tools via the Open Sound Control protocol.
|
#4 | The JSafran platform for semi-automatic speech processing
Christophe Cerisara (LORIA-CNRS UMR 7503) Claire Gardent (LORIA-CNRS UMR 7503)
JSafran is an open-source Java platform for editing, annotating and transforming speech corpora both manually and automatically
at many levels: transcription, alignment, morphosyntactic tagging, syntactic parsing and semantic roles labelling.
It integrates preconfigured state-of-the-art libraries for this purpose, including
the Sphinx4, TreeTagger, OpenNLP, MaltParser and MATE applications, as well as the companion JTrans software for
text-to-speech alignment and transcription.
Despite the complexity of such speech processing tasks, JSafran has been designed to maximize simplicity both for the end-user, thanks
to an easy-to-use GUI that controls all of these automatic and manual annotation functionalities, and for the developer, thanks to
well-defined interfaces and to the multi-level stand-off annotation paradigm.
JSafran has been used so far for several tasks, including the creation of
a new French treebank on top of the broadcast news ESTER corpus.
|
#5 | The Social Signal Interpretation Framework (SSI) for Real Time Signal Processing and Recognition
Johannes Wagner (Lab for Human Centered Multimedia, Augsburg University) Florian Lingenfelser (Lab for Human Centered Multimedia, Augsburg University) Elisabeth Andre (Lab for Human Centered Multimedia, Augsburg University)
The construction of systems for recording, processing and recognising a human's social and affective signals is a challenging effort that includes numerous but necessary sub-tasks to be dealt with. In this article, we introduce our Social Signal Interpretation (SSI) tool, a framework dedicated to support the development of such systems. It provides a flexible architecture to construct pipelines to handle multiple modalities like audio or video and establishing on- and offline recognition tasks. The plug-in system of SSI encourages developers to integrate external code, while a XML interface allows anyone to write own applications with a simple text editor. Furthermore, data recording, annotation and classification can be done using a straightforward graphical user interface, allowing simple access to inexperienced users.
|
#6 | ELAN – aspects of interoperability and functionality
Han Sloetjes (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands) Peter Wittenburg (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands) Aarthy Somasundaram (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands)
ELAN is a multimedia annotation tool that has been developed for roughly ten years now and is still being extended and improved in, on average, two or three major updates per year. This paper describes the current state of the application, the main areas of attention of the past few years and the plans for the near future. The emphasis will be on various interoperability issues: interoperability with other tools through file conversions, process based interoperability with other tools by means of commands send to or received from other applications, interoperability on the level of the data model and semantic interoperability.
|
#7 | Open source voice creation toolkit for the MARY TTS Platform
Marc Schröder (DFKI GmbH) Marcela Charfuelan (DFKI GmbH) Sathish Pammi (DFKI GmbH) Ingmar Steiner (INRIA/LORIA Speech Group)
This paper describes an open source voice creation toolkit that
supports the creation of unit selection and HMM-based voices, for
the MARY (Modular Architecture for Research on speech Synthesis) TTS
platform. The toolkit can be easily employed to create voices in
the languages already supported by MARY TTS, but also provides the
tools and generic reusable run-time system modules to add new
languages. The voice creation toolkit is mainly intended to be used
by research groups on speech technology throughout the world,
notably those who do not have their own pre-existing technology
yet. We try to provide them with a reusable technology that lowers
the entrance barrier for them, making it easier to get started. The
toolkit is developed in Java and includes an intuitive Graphical User
Interface (GUI) for most of the common tasks in the creation of a
synthetic voice. We present the toolkit and discuss a number of
interoperability issues.
|
#8 | Java Visual Speech Components for Rapid Application Development of GUI based Speech Processing Applications
Stefan Steidl (International Computer Science Institute (ICSI)) Korbinian Riedhammer (Computer Science Department, University of Erlangen-Nuremberg, Germany) Tobias Bocklet (Computer Science Department, University of Erlangen-Nuremberg, Germany) Florian Hönig (Computer Science Department, University of Erlangen-Nuremberg, Germany) Elmar Nöth (Computer Science Department, University of Erlangen-Nuremberg, Germany)
In this paper, we describe a new Java framework for an easy and efficient way of developing new GUI based speech processing applications.
Standard components are provided to display the speech signal, the power plot, and the spectrogram.
Furthermore, a component to create a new transcription and to display and manipulate an existing transcription is provided, as well as a component to display and manually correct external pitch values.
These Swing components can be easily embedded into own Java programs.
They can be synchronized to display the same region of the speech file.
The object-oriented design provides base classes for rapid development of own components.
|
#9 | mTalk - A Multimodal Browser for Mobile Services
Michael Johnston (AT&T; Labs - Research, Inc.) Giuseppe Di Fabbrizio (AT&T; Labs - Research, Inc.) Simon Urbanek (AT&T; Labs - Research, Inc.)
The mTalk multimodal browser is a tool which enables
rapid prototyping for research and development of
mobile multimodal interfaces combining natural
modalities such as speech, touch, and gesture.
mTalk integrates a broad range of open standards for
authoring graphical and spoken user interfaces and
is supported by a cloud-based multimodal processing
architecture. In this paper, we describe mTalk and
illustrate its capabilities through examination of
a series of sample applications.
|
#10 | Web-based automatic speech recognition service - webASR
Stuart Nicholas Wrigley (University of Sheffield) Thomas Hain (University of Sheffield)
A state-of-the-art automatic speech recognition (ASR) system was developed as part of the AMIDA project whose core domain was the transcription of small to medium sized meetings. The system has performed well in recent NIST evaluations (RT'07 and RT'09). This research-grade ASR system has now been made available as a free web service (webASR) targeting non-commercial researchers. Access to the service is via and standard browser-based interface as well as an API. The service provides the facility to upload audio recordings which are then processed by the ASR system to produce a word-level transcript. Such transcripts are available in a range of formats to suite different needs and technical expertise. The API allows the core webASR functionality to be integrated seamlessly into applications and services. Detailed descriptions of the system design and user interface are provided.
|
#11 | A Web based Speech Transcription Workplace
Markus Klehr (European Media Laboratory GmbH, Heidelberg, Germany) Andreas Ratzka (European Media Laboratory GmbH, Heidelberg, Germany) Thomas Ross (European Media Laboratory GmbH, Heidelberg, Germany)
We describe our web based speech transcription tool EML Transcription
Workplace (TWP). Apart from its main purpose of annotating audio data, it
also includes support for the management of transcription data, ASR based
pre-transcription, assignment of work packages to specific users, user
management and a correction/verification workflow. These features help to
increase the productivity for both transcriptionists and supervisors and facilitates further processing.
|
#12 | WinPitch, a multimodal tool for speech analysis of endangered languages
Philippe Martin (UFRL, Université Paris Diderot)
WinPitch is a speech analysis program running on PC and Mac for acoustical analysis of speech corpora. It includes a large number of specialized functions to transcribe, align and analyze large sound and video recordings. It supports multiple hierarchical layers for segmentation (up to 96 layers), speaker lists, and overlapping speech. Various character encodings, including Unicode, are supported, with optional right to left text display for Arabic and Hebrew transcriptions. Interfaces with other popular speech analysis programs are provided, as well as standard alignment input and output in XML format. Many functions are devoted to the transcription, alignment and description of less documented languages, such as slow speed playback, programmable keyboard, automatic lexicon generation and text labeling. Various software functions are described together with their applications to the analysis of Parkatêjê, a Timbira language spoken in the Amazonia by about 400 speakers..
|
#13 | Recording caregiver interactions for machine acquisition of spoken language using the KLAIR virtual infant
Mark Huckvale (University College London)
The goals of the KLAIR project are to facilitate research into the computational modelling of spoken language acquisition. Previously we have described the KLAIR toolkit that implements a virtual infant that can see, hear and talk. In this paper we describe how the toolkit has been enhanced and extended to make it easier to build interactive applications that promote dialogues with human subjects, and also to record and document them. Primary developments are the introduction of 3D models, integration of speech recognition, real-time video recording, support for .NET languages, and additional tools for supporting interactive experiments. An example experimental configuration is described in which KLAIR appears to learn how to say the names of toys in order to encourage dialogue with caregivers.
|
|
|