Tutorials

The Interspeech 2011 Organising committee is pleased to announce the following distinguished Tutorials at the conference:

Morning Tutorials

More than Words Can Say: Prosodic Analysis Techniques and Applications, Andrew Rosenberg
Pitch Estimation: Advances and Speech Applications, Mads Grasboll Christensen
Building an Open Vocabulary ASR System using Open Source Software, Stefan Hahn, David Rybach
Learning with Rich Prior Knowledge, Joao Graca, Gregory Druck, and Kuzman Ganchev
Blind Speech Separation based on Independent Component Analysis and Sparse Component Analysis, Shoji Makino, Hiroshi Sawada

Afternoon Tutorials

Functional Data Analysis for Speech Research, Michele Gubian
Registers and Resonances in Singing Joe Wolfe, John Smith, & Ma�va Garnier
Low-dimensional speech representation based on Factor Analysis and its applications, Najim Dehak, Stephen Shum
Spoken Query Understanding, Xiao Li, Gokhan Tur
Automatic Summarization, Ani Nenkova, Sameer Maskey, Yang Liu

Sincerely,
Maurizio Omologo
Tutorials Chair

Saturday August 27th 2011 - Morning

Tutorial M1

More than Words Can Say: Prosodic Analysis Techniques and Applications

	Andrew Rosenberg
Abstract Prosody is a critical component of nearly every aspect of human speech communication. The dynamics of prosodic variation signal carry all of the information in the speech signal beyond the lexical content. This includes semantic information in terms of topicality and relevance, pragmatic content germane to discourse structuring and speech act information and paralinguistic information like age, gender and affect. This tutorial will describe a variety of techniques for prosodic analysis, including approaches to the detection of prominence, and phrasing, shape modeling to describe prosodic events, and classification approaches that have been shown to be most suited to automated analysis. We will then spend some time describing AuToBI an open source tool for automated generation of ToBI tones. This discussion will present both how to use AuToBI as a stand-alone tool, as an API, as well as its internal mechanisms. The tutorial will conclude with a survey of applications of prosodic analysis to spoken language processing tasks. This tutorial will serve as a valuable introduction to the analysis of prosody for speech researchers looking to incorporate this information into their work. Attendees with previous experience working with prosody will be exposed to recent results, and techniques for prosodic analysis. Tutorial Outline: 1. Introduction: [30 mins] 1.1. Defining the dimensions of prosodic variation. [15 mins] 1.2. Prosodic Models [15 mins] 2. Techniques for Prosodic Analysis [75 mins] 2.1. Overview: Direct vs. Indirect Modeling [5 mins] 2.2. Shape Modeling of acoustic contours [25mins] 2.3. Identifying an appropriate context for analysis [20mins] 2.4. Ensemble Techniques [10mins] 2.5. Semi-supervised Approaches [15mins] 3. AuToBI for Prosodic Analysis [30 mins] 3.1. Use as Stand-Alone tool [10mins] 3.2. AuToBI Internals [15mins] 3.3. Use as an API [5mins] 4. Applications of Prosodic Analysis in Spoken Language Processing [45 mins] 4.1. Topic and Discourse Segmentation [15mins] 4.2. Speech Synthesis [5mins] 4.3. Speech Act Classification [10mins] 4.4. Speech Recognition [10mins] 4.5. Affect Classification [5mins]
Short Bio: Andrew Rosenberg is an Assistant Professor of Computer Science at Queens College (CUNY) where he teaches computational linguistics and machine learning. He has been researching and developing prosodic analysis techniques and applications for seven years. He is an author of more than a dozen publications concerning prosodic analysis. He is also the author and maintainer of AuToBI, the first publicly available, open-source software for prosodic analysis. Webpage: https://eniac.cs.qc.cuny.edu/andrew Blog: https://spokenlanguageprocessing.blogspot.com

Tutorial M2

Pitch Estimation: Advances and Speech Applications

	Mads Grasboll Christensen (Aalborg University, Department of Architecture, Design and Media Technology)
Abstract Many natural signals are periodic or approximately so, and this is also the case for voiced speech signals. Such signals can be decomposed into sets of sinusoids having frequencies that are integer multiples of a fundamental frequency. The problem of finding such fundamental frequencies from observed signals is important in many speech applications, where it is commonly referred to as pitch estimation. These applications include analysis, compression, separation, enhancement, and many more. In this tutorial, recent advances in pitch estimation are presented. More specifically, a number of methodologies that have recently been proposed for both single- and multi-pitch estimation are presented. The methods, which are parametric in nature, are based on statistical approaches including maximum likelihood, optimal filtering, and subspace methods. The application of these methods to speech signals is discussed and demonstrated and their performance is assessed under various experimental conditions. The estimators are compared in terms of computational and statistical efficiency and robustness, and open issues and directions for future research are discussed.
Short Bio: Mads Grasboll Christensen was born in Copenhagen, Denmark, in March 1977. He received the M.Sc. and Ph.D. degrees from Aalborg University, Denmark, in 2002 and 2005, respectively. He was formerly with the Department of Electronic Systems, Aalborg University, and is currently an Associate Professor in the Department of Architecture, Design and Media Technology. He has been a Visiting Researcher at Philips Research Labs, Ecole Nationale Sup�rieure des T�l�communications (ENST), University of California, Santa Barbara (UCSB), and Columbia University. He has published more than 75 papers in peer-reviewed conference proceedings and journals and is coauthor (with A.~Jakobsson) of the book Multi-Pitch Estimation (Morgan & Claypool Publishers, 2009). His research interests include digital signal processing theory and methods with application to speech and audio, in particular parametric analysis, modeling, and coding. Dr. Christensen has received several awards, namely an IEEE International Conference on Acoustics, Speech and Signal Processing Student Paper Contest Award, the Spar Nord Foundation's Research Prize awarded for his Ph.D. dissertation, and a Danish Independent Research Councils Young Researcher's Award. He is an Associate Editor for the IEEE Signal Processing Letters.

Tutorial M3

Building an Open Vocabulary ASR System using Open Source Software

	Stefan Hahn (RWTH Aachen University,Computer Science Department)
	David Rybach (RWTH Aachen University,Computer Science Department)
Abstract Most speech recognition systems handle closed vocabularies only. For applications processing nearly unconstrained speech input, even huge vocabularies cannot cover all words. The presented open vocabulary approach allows for the recognition of unknown words by incorporating sub-word units, for example word fragments, in the recognition process. These sub-word units may be merged together to form new words. In this tutorial we will explain both the theoretical approach to open vocabulary recognition and its application in a speech recognition system in practice. There is a growing number of publicly available open source software packages related to language and speech processing. Within this tutorial, all the necessary steps to build an automatic speech recognition (ASR) system with open vocabulary for English will be presented relying solely on open source software and publicly available data. Besides the review of the required theoretical background, the focus is on the practical aspects of building an open vocabulary speech recognition system from scratch and the potential pitfalls. The tutorial will use the RWTH Aachen University Open Source Speech Recognition Toolkit (RWTH ASR) for the development of acoustic models as well as the included large vocabulary continuous speech recognition decoder. All tasks required to set up a baseline speech recognizer will be covered. Starting from the configuration of the signal analysis, also the estimation of Gaussian mixture models, phonetic decision trees, and speaker adaptation techniques are presented. Here, the focus is not as much on the theoretical aspects of these methods but on how to use them within RWTH ASR. The training of language models will be demonstrated using the SRI LM toolkit, which is used throughout the speech community. The Sequitur g2p toolkit will be applied to generate pronunciations for words not included in the pronunciation dictionary used. It can also be applied to segment words into fragments and thus allow us to build a hybrid language model and pronunciations for word fragments, which leads to an open vocabulary ASR system. For error analysis and evaluation, the NIST SCTK speech recognition scoring toolkit will be used. The English example task used throughout the tutorial is based only on publicly available data, in particular audio data and accompanying transcriptions published under an open source licence by Voxforge and public domain books from Project Gutenberg. The size of the example task is chosen relatively small such that it can be processed on a standard desktop computer in reasonable time. Additionally to the trained statistical models and the used data, a collection of ready-for-use scripts and configuration files will be provided to the attendees of the tutorial to easily port the system to different data and tasks. The main targeted audience are all people interested in the practical aspects of building and understanding ASR systems. Knowledge of the theoretical background of basic ASR systems is assumed.
Short Bios: Stefan Hahn studied computer science at RWTH Aachen University. He joined the Human Language Technology and Pattern Recognition Group headed by Prof. Dr.-Ing. Hermann Ney in 2004. In 2006 he received the Diploma degree in computer science at RWTH Aachen University. He is currently working at the Computer Science Department of RWTH Aachen University as Ph.D. research assistant. His research interests include automatic speech recognition, log-linear modeling, spoken language understanding, and monotone string-to-string translation. https://www-i6.informatik.rwth-aachen.de/publications/index.php?search_in=multi&search;=hahn&orderby;= publish_dateℴ=DESC David Rybach received his Diploma degree in computer science in 2006 from RWTH Aachen University, Germany. In 2006 he joined the Human Language Technology and Pattern Recognition Group headed by Prof. Dr.-Ing. Hermann Ney as Ph.D. research assistant. His main research interests lie in the area of automatic speech recognition. He maintains the RWTH Aachen University Open Source Speech Recognition Toolkit. https://www-i6.informatik.rwth-aachen.de/~rybach/

Tutorial M4

Learning with Rich Prior Knowledge

	Joao Graca (University of Pennsylvania)
	Gregory Druck (University of Massachusetts Amherst, Computer Science)
	Kuzman Ganchev (Google Inc.)
Abstract We possess a wealth of prior knowledge about most prediction problems, and particularly so for many of the fundamental tasks in speech processing and generation. For example, when learning letter to sound rules, even unlabeled words should obey some phonotactic constraints (such as �each word should contain a vowel�). Unfortunately, it is often difficult to make use of this type of information during learning, as it typically does not come in the form of labeled examples, may be difficult to encode as a prior on parameters in a Bayesian setting, and may be impossible to incorporate into a tractable model. Instead, we usually have prior knowledge about the values of output variables. For example, we might know that several different speech recognizers should learn to agree on untranscribed data. Motivated by the prospect of being able to naturally leverage such knowledge, four different groups have recently developed similar, general frameworks for expressing and learning with side information about output variables. These frameworks are Constraint-Driven Learning (UIUC), Posterior Regularization (UPenn), Generalized Expectation Criteria (UMass Amherst), and Learning from Measurements (UC Berkley). This tutorial describes how to encode side information about output variables, and how to leverage this encoding and an unannotated corpus during learning. We survey the different frameworks, explaining how they are connected and the trade-offs between them. We also survey some applications to natural language processing that have been explored in the literature including: grammar and part of speech induction, word alignment, information extraction, text classification, and multi-view learning. Prior knowledge used in these applications ranges from structural information that cannot be efficiently encoded in the model, to knowledge about the approximate expectations of some features, to knowledge of some incomplete and noisy labellings. These applications show that a wide variety of domain knowledge can be used to guide unsupervised or semi-supervised learning. The variety of applications and diversity of prior knowledge employed so far demonstrates the generality of these approaches. While previous applications have been in language processing, there are numerous opportunities for applying the frameworks to speech processing, and we highlight several of them. The tutorial will provide the audience with the theoretical background to understand why these methods have been so effective, as well as practical guidance on how to apply them. Specifically, we discuss issues that come up in implementation, and describe a toolkit that provides �out-of-the-box� support for the applications described in the tutorial, and is extensible to other applications and new types of prior knowledge.
Short Bios: Joao Graca Joao Graca is a post doctoral researcher at the University of Pennsylvania. He obtained his PhD in Computer Science Engineering at Instituto Superior T�cnico, Technical University of Lisbon, where he was advised jointly by Luisa Coheur, Fernando Pereira and Ben Taskar. His main research interest are Machine Learning and Natural Language Processing. Currently his research focus on unsupervised learning with high level supervision in the form of domain specific prior knowledge, and on the utility of unsupervised methods for real world applications. joao.graca@l2f.inesc-id.pt https://www.cis.upenn.edu/~graca/ Gregory Druck Gregory Druck is a final year PhD student in Computer Science at the University of Massachusetts Amherst, advised by Andrew McCallum. His research interests include semi-supervised and active machine learning for natural language processing and information extraction. His dissertation focuses on leveraging prior knowledge to reduce annotation effort. gdruck@cs.umass.edu https://www.cs.umass.edu/~gdruck/ Kuzman Ganchev Kuzman Ganchev is research scientist at Google Inc. He obtained his PhD in Computer and Information Science at the University of Pennsylvania, where he was jointly advised by Fernando Pereira and Ben Taskar. His research interests are in machine learning applied to natural language processing, and in particular to the use of partial supervision to guide learning. He has worked on problems in biomedical information extraction, machine translation, unsupervised and supervised dependency parsing, semi-supervised learning for NLP and computational finance. kuzman@google.com https://www.seas.upenn.edu/~kuzman/

Tutorial M5

Blind Speech Separation based on Independent Component Analysis and Sparse Component Analysis

	Shoji Makino (University of Tsukuba)
	Hiroshi Sawada (NTT Communication Science Laboratories)
Abstract This tutorial describes a state-of-the-art method for the blind source separation (BSS) of convolutive mixtures of audio signals. Independent component analysis (ICA) is used as a major statistical tool for separating the mixtures. We provide examples to show how ICA criteria change as the number of audio sources increases. We then discuss a frequency-domain approach where simple instantaneous ICA is employed in each frequency bin. A directivity pattern analysis of the ICA solutions provides us with a physical interpretation of the ICA-based separation. It tells us the relationship between ICA-based BSS and adaptive beamforming. In order to obtain properly separated signals with the frequency-domain approach, the permutation and scaling ambiguity of the ICA solutions should be aligned appropriately. We describe two complementary methods for aligning the permutations, i.e., collecting separated frequency components originating from the same source. The first method exploits the signal envelope dependence of the same source across frequencies. The second method relies on the spatial diversity of the sources, and is closely related to source localization techniques. Finally, we describe methods for sparse source separation, which can be applied even to an underdetermined case. The tutorial will end with a live demonstration of BSS in a real room situation.
Short Bios: Shoji Makino received B. E., M. E., and Ph. D. degrees from Tohoku University, Japan, in 1979, 1981, and 1993, respectively. He joined NTT in 1981. He is now a Professor at University of Tsukuba. His research interests include adaptive filtering technologies, the realization of acoustic echo cancellation, blind source separation of convolutive mixtures of speech, and acoustic signal processing for speech and audio applications. He received the ICA Unsupervised Learning Pioneer Award in 2006, the IEEE MLSP Competition Award in 2007, the TELECOM System Technology Award in 2004, the Achievement Award of the Institute of Electronics, Information, and Communication Engineers (IEICE) in 1997, and the Outstanding Technological Development Award of the Acoustical Society of Japan (ASJ) in 1995, the Paper Award of the IEICE in 2005 and 2002, the Paper Award of the ASJ in 2005 and 2002. He is the author or co-author of more than 200 articles in journals and conference proceedings and is responsible for more than 150 patents. He was a Keynote Speaker at ICA2007, a Tutorial speaker at ICASSP2007. He has served on IEEE SPS Awards Board (2006-08) and IEEE SPS Conference Board (2002-04). He is a member of the James L. Flanagan Speech & Audio Processing Award Committee. He was an Associate Editor of the IEEE Transactions on Speech and Audio Processing (2002-05) and is an Associate Editor of the EURASIP Journal on Advances in Signal Processing. He is a member of SPS Audio and Acoustics Signal Processing Technical Committee and the Chair of the Blind Signal Processing Technical Committee of the IEEE Circuits and Systems Society. He was the Vice President of the Engineering Sciences Society of the IEICE (2007-08), and the Chair of the Engineering Acoustics Technical Committee of the IEICE (2006-08). He is a member of the International IWAENC Standing committee and a member of the International ICA Steering Committee. He was the General Chair of WASPAA2007, the General Chair of IWAENC2003, the Organizing Chair of ICA2003, and is the designated Plenary Chair of ICASSP2012. Dr. Makino is an IEEE SPS Distinguished Lecturer (2009-10), an IEEE Fellow, an IEICE Fellow, a council member of the ASJ, and a member of EURASIP. Shoji Makino University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8577 Japan E-mail: maki@tara.tsukuba.ac.jp, Tel/Fax +81-29-853-6432 https://www.tara.tsukuba.ac.jp/~maki/ Hiroshi Sawada received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1991, 1993, and 2001, respectively. He joined NTT Corporation in 1993. He is now the Group Leader of Learning and Intelligent Systems Research Group at the NTT Communication Science Laboratories, Kyoto, Japan. His research interests include statistical signal processing, audio source separation, array signal processing, machine learning, latent variable model, graph-based data structure, and computer architecture. From 2006 to 2009, he served as an associate editor of the IEEE Transactions on Audio, Speech and Language Processing. He is a member of the Audio and Acoustic Signal Processing Technical Committee of the IEEE Signal Processing Society. He received the Ninth TELECOM System Technology Award for Student from the Telecommunications Advancement Foundation in 1994, the Best Paper Award of the IEEE Circuits and System Society in 2000, and the MLSP Data Analysis Competition Award in 2007. He was a Tutorial speaker at ICASSP2007. He served as a member of the Evaluation Organizing Committee (Audio Committee) of SiSEC 2010 (Signal Separation Evaluation Campaign) featured at LVA/ICA 2010. Hiroshi Sawada NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan Email: sawada@cslab.kecl.ntt.co.jp, Tel: +81-774-93- 5272, Fax: +81-774-93-5155 https://www.kecl.ntt.co.jp/icl/signal/sawada/index.html

Saturday August 27th 2011 - Afternoon

Tutorial A1

Functional Data Analysis for Speech Research

	Michele Gubian ( Centre for Language and Speech Technology in Nijmegen, the Netherlands)
Abstract The analysis of the speech signal often requires dealing with data in the form of contours in time, like F0, formants, intensity, etc. Contour analysis usually boils down to (manual) identification of a set of 'important points', like peaks, valleys and elbows, whose coordinates serve as numerical shape descriptors. Those descriptors are a suitable input for classic statistical tools like Principal Component Analysis (PCA), linear regression, ANOVA. This approach incurs several problems. First, one needs to decide in advance which shape traits are relevant and which are not. For example, by deciding to describe contours only in terms of peak/valley points one implicitly excludes that aspects like concavity/convexity may play a role. Second, the identification of peaks and valleys can be difficult and potentially ambiguous (for example an intonational *H target may be physically realized as a plateau in an F0 contour). Finally, the identification of the reference points often depends on the judgement of trained listeners, which introduces the risk of bias, issues with respect to inter-annotator agreement, and high costs. The purpose of this tutorial is to introduce Functional Data Analysis (FDA) as a solution to the aforementioned problems. FDA is a suite of statistical tools introduced in the 90s by J. Ramsay and colleagues. FDA modifies well-known techniques like PCA and linear regression in such a way that they can take whole contours (functions of time) as input variables, as opposed to fixed length vectors of numbers. In this way, all the information contained in the shape of contours is preserved and used in the analysis. As a result, the intermediate step of selecting the relevant shape traits and (manually) performing the required measurements is eliminated. In this tutorial, I will show how to carry out FDA on a set of contours, mainly using F0 and formants as examples. I will show how to obtain a functional representation of sampled contours (e.g. F0 from Praat), how to deal with signals with different duration, how to apply functional PCA (FPCA) and how to interpret the graphical and numerical results. Then we turn to more sophisticated approaches, such as the analysis of more than one feature at once (e.g. F0 and intensity, or F0 and speech rate). Finally, I will show the use of FPCA as an exploration tool for speech re-synthesis, as commonly used for stimuli manipulation for perceptual experiments. This tutorial is of direct interest for researchers in prosody, phonetics, and speech analysis in general (e.g. pathological speech), and for everyone who manipulates the speech signal for perceptual experiments. Moreover, FDA techniques are of interest for the analysis of all time signals, including marker trajectories used in articulation and gesture analysis (like EMA), eye tracking studies, EEG, etc. Attendees should have basic knowledge of multi-dimensional statistics. No advanced mathematical skills are required. Since the reference FDA tool is written in R, basic experience with the R software environment will be beneficial.
Short Bio: Michele Gubian obtained his Master in Telecommunication Engineering at Politecnico di Milano, Italy, in 2004 with a thesis on performance evaluation for ASR systems carried out at STMicroelectronics Labs in Milan. In 2008 he obtained his PhD degree at the International Doctorate School in Information and Communication Technologies, University of Trento, Italy. His research task was to identify new strategies to design smart sensors, i.e. sensors endowed with intelligent functionalities like classification. In 2008 he became a post-doc researcher in the Marie Curie Research Training Network Sound-to-Sense (www.sound2sense.eu) at the Centre for Language and Speech Technology in Nijmegen, the Netherlands. His main task was to develop tools for data-driven phonetic research. He proposed the use of Functional Data Analysis (FDA) to tackle problems in prosody and phonetics where F0 or intensity contours have to be analysed. His contribution consists in adapting FDA to the specific needs of speech analysis as well as to make FDA accessible for scientists with limited mathematical skills and expertise. In 2009 he was offered a second (parallel) post-doc position at his host institute in a project on computational models for language acquisition, where he is investigating to what extent pure statistical learning of sound patterns can reproduce findings in early language acquisition. More information on Michele's research on FDA can be found on his website at: https://lands.let.ru.nl/FDA

Tutorial A2

Registers and Resonances in Singing

	Joe Wolfe (University of of New South Wales)
	John Smith (University of of New South Wales)
	Ma�va Garnier (GIPSA-lab Grenoble)
Abstract Singing typically involves a much wider range of fundamental frequency and sound level than normal speech, and consequently samples a broader volume of parameter space. Furthermore, both the laryngeal mechanism and the resonance frequencies of the tract play important roles. Fortunately, singers are skilled at controlling vocal parameters and providing samples that can last several seconds. This tutorial reviews, aimed a speech researchers, the challenges of singing in the light of recent research, and gives an extended introduction to some of the techniques used to study laryngeal mechanisms and tract resonances in the singing voice. To cover the six octave range available to singing, three different laryngeal mechanism are available. Mechanism M1 (men's normal or women's 'chest' voice) includes the vocalis muscle in vocal fold vibration. In M2 (falsetto or 'head' voice), only a surface layer of the folds vibrates. The M1-M2 transition occurs around 400 Hz (~G4) for both sexes and requires techniques, particularly by altos and tenors, to displace or to disguise it. The transition from M2 to M3 (whistle voice) falls typically around 1 kHz and is used by some sopranos for coloratura, pop and jazz. Considerable detail about the mechanism is obtained non-invasively by electroglottography, in which the high frequency electrical admittance is measured between pairs of electrodes placed on the neck at the level of the larynx. Nasendoscopy uses a camera (either high speed video or stroboscopic) to view the glottis from above, via fibre optic cable inserted through the nose. As in speech, the first two tract resonances, R1 and R2, carry phoneme information. In singing, they and the higher resonances can also give a useful boost to the radiated power when a voice harmonic falls near a resonance. The higher resonances R3 to R5 occur around a few kHz, where not only is hearing most sensitive, but also orchestras are less powerful: these are thought to contribute to the 'singers formant', a band of enhanced power around 3 kHz.� For some voice ranges, the wide spacing of harmonics, and the fact that f0 can exceed the usual values of R1, require resonance tuning strategies, such as tuning R1 to f0. As well as producing a power boost, this tuning may improve voice stability. Systematic R1:f0 tuning is widely observed in sopranos and the upper limit of R1:f0 tuning often limits their range. Some, however, can then tune R2 to f0 and gain an additional octave or more using the M3 mechanism. The tutorial will discuss the various strategies used by different voice ranges that involve tuning f0 or one of its harmonics to R1 and/or R2. The tutorial will also introduce the technique of measuring the Ri precisely during singing using broadband excitation at the mouth.
Short Bios: Ma�va Garnier is a post-doctoral researcher at GIPSA-lab in Grenoble. In 2007, she obtained her Ph.D thesis in acoustic phonetics at the University of Paris 6, for her work on speech communication in noisy environments. From 2007 to 2010, she worked as a research associate in the Music Acoustics Group of the University of New South Wales in Sydney on vocal registers and vocal tract adjustments in speech and singing. From 2010, she started working at the Speech and Cognition Department of GIPSA-lab in Grenoble on the neural correlates of speech imitation and phonetic convergence. https://www.gipsa-lab.inpg.fr/~maeva.garnier/recherches_en.html John Smith is an associate professor of physics at the University of New South Wales. Originally his research was focused on the electrical properties of plant cell membranes, with some emphasis on their electrical impedance. In collaboration with his colleague, Joe Wolfe, he has used his experience with computer programming and interfacing to develop apparatus that can study the detailed acoustic properties of the vocal tract, musical instruments, and their interaction in performance. Joe Wolfe is a professor of physics at the University of New South Wales (Sydney). He previously held postdoctoral positions at Cornell University (NY) and the CSIRO (Canberra) and invited professorship at the Ecole Normale Sup�rieure (Paris). Late last century, he and colleague John Smith established a laboratory to investigate musical instruments, including the voice. Among other tools, they developed techniques to study the roles of vocal tract resonances in the voice and wind instrument performance. The team publishes regularly in high profile journals, and also maintains extensive web sites on music and physics, including voice acoustics, for the benefit of teachers and students.

Tutorial A3

Low-dimensional speech representation based on Factor Analysis and its applications

	Najim Dehak ( Spoken Language Systems - SLS Group at the MIT Computer Science and Artificial Intelligence Laboratory - CSAIL)
	Stephen Shum (MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
Abstract We introduce a novel approach to data-driven feature extraction stemming from the field of speaker recognition.� In the last five years, statistical methods rooted in factor analysis have greatly enhanced the traditional representation of a speaker using Gaussian Mixture Models (GMMs).� In this tutorial, we build some intuition by outlining the historical development of these methods and then survey the variety of applications made possible by this approach. To begin, we discuss the development of Joint Factor Analysis (JFA), which was motivated by a desire to both model speaker variabilities and compensate for channel/session variabilities at the same time. In doing so, we introduce the notion of a GMM supervector, a high-dimensional vector created by concatenating the mean vectors of each GMM component. JFA assumes that this supervector can be decomposed into a sum of two parts: one containing relevant speaker-specific information and another containing channel-dependent nuisance factors that need to be compensated.� We will describe the methods used to estimate these hidden parameters. The success of JFA led to a proposed simplification using just factor analysis for the extraction of speaker-relevant features. The key assumption here is that most of the variabilities between GMM supervectors can be explained by a (much) lower-dimensional space of underlying factors.� In this approach, a given utterance of any length is mapped into a single, low-dimensional �total variability� space.� We call the resulting vector an i-vector, short for �identity vector� in the speaker recognition sense or �intermediate vector� for its intermediate size between that of a supervector and that of an acoustic feature vector. Unlike in JFA, the total variability approach makes no distinction between speaker and inter-session variabilities in the high-dimensional supervector space; instead, channel compensation occurs in the lower-dimensional i-vector space.� Our presentation will provide an outline of the process that can be used to build a robust speaker verification system. Though originally proposed for speaker modeling, the i-vector representation can be seen more generally as an elegant framework for data-driven feature extraction.� After covering the necessary background theory, we will discuss our recent work in applying this approach to a variety of other audio classification problems, including speaker diarization, emotion recognition, language identification, and even an attempt on birdsong classification.
Short Bios: Najim Dehak received his Engineering degree in Artificial Intelligence in 2003 from Universite des Sciences et de la Technologie d�Oran, Algeria, and his MS degree in Pattern Recognition and Artificial Intelligence Applications in 2004 from the Universite de Pierre et Marie Curie, Paris, France. He obtained his Ph.D. degree from Ecole de Technologie Superieure (ETS), Montreal in 2009. During his Ph.D. studies he was also with Centre de recherche informatique de Montreal (CRIM), Canada. In the summer of 2008, he participated in the Johns Hopkins University, Center for Language and Speech Processing, Summer Workshop.� During that time, he proposed a new system for speaker verification that uses factor analysis to extract speaker-specific features, thus paving the way for the development of the i-vector framework. Dr. Dehak is currently a research scientist in the Spoken Language Systems (SLS) Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests are in machine learning approaches applied to speech processing and speaker modeling. The current focus of his research involves extending the concept of an i-vector representation into other audio classification problems, such as speaker diarization, language- and emotion-recognition. Stephen Shum received his B.S. in Electrical Engineering and Computer Sciences (EECS) from the University of California, Berkeley in May 2009.� Since then, he has been a graduate student in EECS at MIT and a research assistant in SLS working with Dr. Jim Glass and Dr. Dehak on audio diarization and speaker recognition.� Stephen will obtain his S.M. in June 2011 and looks forward to continuing his exploration of statistical machine learning in the context of speech and music signal processing during his Ph.D. studies.

Tutorial A4

Spoken Query Understanding

	Xiao Li (Microsoft Research)
	Gokhan Tur (Microsoft Research)
Abstract As search engine technologies evolve, it is increasingly believed that Web IR will be shifting away from iten blue links� toward understanding intent and serving objects. Query understanding, aiming at extracting intent and structure from search queries, has become an important and attractive area, as witnessed by several recent workshops on this topic. Spoken language understanding (SLU), on the other hand, has long been investigated in the speech community. It has historically focused on narrowly restricted domains, e.g., travel, banking. Only recently, the burgeoning development of voice search technologies opens up opportunities to develop web-scale SLU systems. Despite their parallel development, query understanding and SLU have been investigating very similar tasks, including intent classification, entity extraction, and higher-level tasks such as topic segmentation and interaction/dialog modeling. Statistical methods are dominantly used in both communities, with upsurge of interest in semi-supervised or lightly-supervised approaches. Spoken query understanding (SQU) is an emerging, interdisciplinary research area that connects SLU with Web IR. The goal is to extract the imeaning� of queries uttered to a voice search system. SQU brings up new challenges to both speech and IR communities. A query can be expressed in full language, as what traditional SLU deals with; or it can consist of only keywords without apparent syntactic structure, as is usually seen in Web IR. The query can belong to a diversified set of domains with structured data sources at the backend, or it needs to be handled by a general-purpose search engine. Moreover, due to the sheer scale of the Web, it becomes important to go beyond supervised systems, and to handle uncertainty caused by ASR errors. This tutorial is intended to provide an overview of the research topics associated with SQU, and to foster technical communications between the SLU and IR communities. We will review the related arts in both query understanding and SLU, comparing and contrasting these with each other; and introduce recent development in SQU.� Our tutorial will especially focus on methods that can potentially scale up to the Web domain. The tutorial is suitable to researchers, students and developers who would like to learn the challenges and opportunities in spoken query understanding, and who are interested in SLU and web IR in general.
Short Bios: Xiao Li is a researcher in the Speech Technology Group at Microsoft Research since 2007. She received the B.S. degree from Tsinghua University, Beijing, China, in 2001, and the Ph.D. degree from the University of Washington, Seattle, in 2007. Her research interests include speech and language understanding, information retrieval and machine learning. She has published over 40 referred papers in these areas, and is inventor of 15 granted/pending patents. Xiao Li has been serving as PC member in a number of international conferences, e.g., ACL, EMNLP, SIGIR and KDD, and is reviewer of numerous journals and conferences including IEEE Transactions on Audio, Speech, and Language Processing, IEEE Signal Processing Letters, Journal of Machine Learning Research, NIPS, ICASSP and etc. She is a recipient of 2010 Microsoft Gold Star Award and 2005 Microsoft Research Fellowship. Gokhan Tur was born in Ankara, Turkey in 1972. He received his B.S., M.S., and Ph.D. degrees from the Department of Computer Science, Bilkent University, Turkey in 1994, 1996, and 2000 respectively. Between 1997 and 1999, he visited the Center for Machine Translation of CMU, then the Department of Computer Science of Johns Hopkins University, and then the Speech Technology and Research Lab of SRI International. He worked at AT&T Labs - Research from 2001 to 2006 and at the Speech Technology and Research Lab of SRI International from 2006 to June 2010. He is currently with Microsoft working as a principal scientist. His research interests include spoken language understanding (SLU), speech and language processing, machine learning, and information retrieval and extraction. He co-authored more than 75 papers published in refereed journals and presented at international conferences. He is the editor of the book entitled "Spoken Language Understanding - Systems for Extracting Semantic Information from Speech" by Wiley in 2011. Dr. Tur is also the recipient of the Speech Communication Journal Best Paper awards by ISCA for 2004-2006 and by EURASIP for 2005-2006. Dr. Tur is the organizer of the HLT-NAACL 2007 Workshop on Spoken Dialog Technologies, and the HLT-NAACL 2004 and AAAI 2005 Workshops on SLU, and the editor of the Speech Communication Special Issue on SLU in 2006. He is also the spoken language processing area chair for IEEE ICASSP 2007, 2008, and 2009 conferences, spoken dialog area chair for HLT-NAACL 2007 conference, finance chair for IEEE/ACL SLT 2006 and SLT 2010 workshops, and SLU area chair for IEEE ASRU 2005 workshop. Dr. Tur is a senior member of IEEE, ACL, and ISCA, and is an associate editor for the IEEE Transactions on Audio, Speech, and Language Processing journal, and was a member of IEEE Signal Processing Society (SPS), Speech and Language Technical Committee (SLTC) for 2006-2008

Tutorial A5

Automatic Summarization

	Ani Nenkova (Univ. of Pennsylvania)
	Sameer Maskey (IBM Research)
	Yang Liu (Univ. of Texas at Dallas)
Abstract In the past decade, we have seen that the amount of digital data, such as news, scientific articles, blogs, conversations, increases at an exponential pace. The need to address �information overload' by developing automatic summarization systems has never been more pressing. At the same time, approaches and algorithms for summarization have matured and increased in complexity, and interest in summarization research has intensified, with numerous publications on the topic each year. A newcomer to the field may find navigating the existing literature to be a daunting task. In this tutorial, we aim to give a systematic overview of traditional and more recent approaches for text and speech summarization.� A core problem in summarization research is devising methods to estimate the importance of a unit, be it a word, clause, sentence or utterance, in the input.� A few classical methods will be introduced, but the overall emphasis will be on most recent advances. We will cover log-likelihood test for topic word discovery and graph-based models for sentence importance, and will discuss semantically rich approaches based on latent semantic analysis, lexical resources. We will then turn to the most recent Bayesian models of summarization.�For supervised machine learning approaches, we will discuss the suite of traditional features used in summarization, as well as issued with data annotation and acquisition. Ultimately, the summary will be a collection of important units. The summary can be selected in a greedy manner, choosing the most informative sentence, one by one; or the units can be selected jointly, and optimized for informativeness. We discuss both approaches, with emphasis on recent optimization work. In the part on evaluation we will discuss the standard manual and automatic metrics for evaluation, as well as very recent work on fully automatic evaluation. We then turn to domain specific summarization, particularly summarization of scientific articles and speech data (telephone conversations, broadcast news, meetings and lectures). In speech, the acoustic signal �brings more information that can be exploited as features in summarization, but also poses unique problems which we discuss related to disfluencies, lack of sentence or clause boundaries, and recognition errors. We will only briefly touch on key but under-researched issues of linguistic quality of summaries, deeper semantic analysis for summarization, and abstractive summarization. Outline: 1. Computing informativeness (a) Frequency-driven: topic words, clustering, graph approaches (b) Semantic approaches: lexical chains, latent semantic analysis (c) Probabilistic (Bayesian) models (d) Supervised approaches 2. Optimizing informativeness and minimizing redundancy (a) Maximal marginal relevance (b) Integer linear programming (c) Redundancy removal 3. Evaluation (a) Manual evaluation: Responsivness and Pyramid (b) Automatic: Rouge (c) Fully automatic�� 4. Domain specific summarization (a) Scientific articles (b) Biographical (c) Speech summarization (i) Utterance segmentation (ii) Acoustic features (iii) Dealing with recognition errors (iv) Disfluency removal and compression
Short Bios: Ani Nenkova is an Assistant Professor of Computer and Information Science at the University of Pennsylvania. She has worked extensively in the area of text summarization and evaluation of text summarization. She has recently developed methods for fully automatic methods for evaluation of both linguistic quality and content selection in summarization. Univ. of Pennsylvania 330 Walnut St UPenn, CIS, Levine 505 Philadelphia, PA 19104 Phone: 215-898-8745 Email: nenkova@seas.upenn.edu Webpage: https://www.cis.upenn.edu/~nenkova Sameer Maskey is a Research Staff Member at IBM Research in Yorktown Heights, New York. His main research interests are statistical techniques for Natural Language and Speech processing, particularly Machine Translation and Summarization of spoken documents. He has previously worked on other topics such as Information Extraction, Speech Synthesis and Question Answering. IBM Research 1101 Kitchawan Road IBM, Yorktown Heights New York, 10562 Phone: 914-945-1573 Email: smaskey@us.ibm.com Webpage: https://www.cs.columbia.edu/~smaskey Yang Liu is an Assistant Professor of Computer Science at the University of Texas at Dallas. Her research interests are in a broad range of topics in speech and language processing, including summarization, spoken language understanding, prosody modeling in speech, emotion recognition, NLP for informal domains, and using speech and language technology for detection of communication disorders. Univ. of Texas at Dallas 800 W. Campbell. Rd., MS EC 31 The University of Texas at Dallas Richardson, TX 75080, USA Phone: 972-883-6618 Email: yangl@hlt.utdallas.edu Webpage: https://www.hlt.utdallas.edu/~yangl

Sincerely,
Maurizio Omologo
Tutorials Chair

Tutorials

Saturday August 27th 2011 - Morning

Tutorial M1

More than Words Can Say: Prosodic Analysis Techniques and Applications

Abstract

Short Bio:

Tutorial M2

Pitch Estimation: Advances and Speech Applications

Abstract

Short Bio:

Tutorial M3

Building an Open Vocabulary ASR System using Open Source Software

Abstract

Short Bios:

Tutorial M4

Learning with Rich Prior Knowledge

Abstract

Short Bios:

Tutorial M5

Blind Speech Separation based on Independent Component Analysis and Sparse Component Analysis

Abstract

Short Bios:

Saturday August 27th 2011 - Afternoon

Tutorial A1

Functional Data Analysis for Speech Research

Abstract

Short Bio:

Tutorial A2

Registers and Resonances in Singing

Abstract

Short Bios:

Tutorial A3

Low-dimensional speech representation based on Factor Analysis and its applications

Abstract

Short Bios:

Tutorial A4

Spoken Query Understanding

Abstract

Short Bios:

Tutorial A5

Automatic Summarization

Abstract

Short Bios: