|
Tutorial M2
|
Building an Open Vocabulary ASR System using Open Source Software |
|
Stefan Hahn (RWTH Aachen University,Computer Science Department) |
|
David Rybach (RWTH Aachen University,Computer Science Department) | |
AbstractMost speech recognition systems handle closed vocabularies only. For applications processing nearly unconstrained speech input, even huge vocabularies cannot cover all words. The presented open vocabulary approach allows for the recognition of unknown words by incorporating sub-word units, for example word fragments, in the recognition process. These sub-word units may be merged together to form new words. In this tutorial we will explain both There is a growing number of publicly available open source software packages related to language and speech processing. Within this tutorial, all the necessary steps to build an automatic speech recognition (ASR) system with open vocabulary for English will be presented relying solely on open source software and publicly available data. Besides the review of the required theoretical background, the focus is on the practical aspects of building an open vocabulary speech recognition system from scratch and the potential pitfalls. The tutorial will use the RWTH Aachen University Open Source Speech Recognition Toolkit (RWTH ASR) for the development of acoustic models as well as the included large vocabulary continuous speech recognition decoder. All tasks required to set up a baseline speech recognizer will be covered. Starting from the configuration of the signal analysis, also the estimation of Gaussian mixture models, phonetic decision trees, and speaker adaptation techniques are presented. Here, the focus is not as much on the theoretical aspects of these methods but on how to use them within RWTH ASR. The training of language models will be demonstrated using the SRI LM toolkit, which is used throughout the speech community. The Sequitur g2p toolkit will be applied to generate pronunciations for words not included in the pronunciation dictionary used. It The English example task used throughout the tutorial is based only on publicly available data, in particular audio data and accompanying transcriptions published under an open source licence by Voxforge and public domain books from Project Gutenberg. The size of the example task is chosen relatively small such that it can be processed on a standard desktop computer in reasonable time. Additionally to the trained statistical models and the used data, a collection of ready-for-use scripts and configuration files will be provided to the attendees of the tutorial to easily port the system to different data and tasks. The main targeted audience are all people interested in the practical aspects of building and understanding ASR systems. Knowledge of the theoretical background of basic ASR systems is assumed. |
|
Short Bios:Stefan Hahn studied computer science at RWTH Aachen University. He joined the Human Language Technology and Pattern Recognition Group headed by Prof. Dr.-Ing. Hermann Ney in 2004. In 2006 he received the Diploma degree in computer science at RWTH Aachen University. He is currently working at the Computer Science Department of RWTH Aachen University as Ph.D. research assistant. His research interests include automatic speech recognition, log-linear modeling, spoken language understanding, and monotone string-to-string translation. David Rybach received his Diploma degree in computer science in 2006 from RWTH Aachen University, Germany. In 2006 he joined the Human Language Technology and Pattern Recognition Group headed by Prof. Dr.-Ing. Hermann Ney as Ph.D. research assistant. His main research interests lie in the area of automatic speech recognition. He maintains the RWTH Aachen University Open Source Speech Recognition Toolkit. |
Learning with Rich Prior Knowledge |
|
Joao Graca (University of Pennsylvania) |
|
Gregory Druck (University of Massachusetts Amherst, Computer Science) | |
Kuzman Ganchev (Google Inc.) | |
AbstractWe possess a wealth of prior knowledge about most prediction problems, and particularly so for many of the fundamental tasks in speech processing and generation. For example, when learning letter to sound rules, even unlabeled words should obey some phonotactic constraints (such as �each word should contain a vowel�). Unfortunately, it is often difficult to make use of this type of information during learning, as it typically does not come in the form of labeled examples, may be difficult to encode as a prior on parameters in a Bayesian setting, and may be impossible to incorporate into a tractable model. Instead, we usually have prior knowledge about the values of output variables. For example, we might know that several different speech recognizers should learn to agree on untranscribed data. Motivated by the prospect of being able to naturally leverage such knowledge, four different groups have recently developed similar, general frameworks for expressing and learning with side information about output variables. These frameworks are Constraint-Driven Learning (UIUC), Posterior Regularization (UPenn), Generalized Expectation Criteria (UMass Amherst), and Learning from Measurements (UC Berkley). The tutorial will provide the audience with the theoretical background to understand why these methods have been so effective, as well as practical guidance on how to apply them. Specifically, we discuss issues that come up in implementation, and describe a toolkit that provides �out-of-the-box� support for the applications described in the tutorial, and is extensible to other applications and new types of prior knowledge. |
|
Short Bios:Joao Graca joao.graca@l2f.inesc-id.pt Gregory Druck gdruck@cs.umass.edu Kuzman Ganchev |
Blind Speech Separation based on Independent Component Analysis and Sparse Component Analysis |
|
Shoji Makino (University of Tsukuba) |
|
Hiroshi Sawada (NTT Communication Science Laboratories) | |
AbstractThis tutorial describes a state-of-the-art method for the blind source separation (BSS) of convolutive mixtures of audio signals. Independent component analysis (ICA) is used as a major statistical tool for separating the mixtures. We provide examples to show how ICA criteria change as the number of audio sources increases. We then discuss a frequency-domain approach where simple instantaneous ICA is employed in each frequency bin. A directivity pattern analysis of the ICA solutions provides us with a physical interpretation of the ICA-based separation. It tells us the relationship between ICA-based BSS and adaptive beamforming. In order to obtain properly separated signals with the frequency-domain approach, the permutation and scaling ambiguity of the ICA solutions should be aligned appropriately. We describe two complementary methods for aligning the permutations, i.e., collecting separated frequency components originating from the same source. The first method exploits the signal envelope dependence of the same source across frequencies. The second method relies on the spatial diversity of the sources, and is closely related to source localization techniques. Finally, we describe methods for sparse source separation, which can be applied even to an underdetermined case. The tutorial will end with a live demonstration of BSS in a real room situation. |
|
Short Bios:Shoji Makino received B. E., M. E., and Ph. D. degrees from Tohoku University, Japan, in 1979, 1981, and 1993, respectively. He joined NTT in 1981. He is now a Professor at University of Tsukuba. His research interests include adaptive filtering technologies, the realization of acoustic echo cancellation, blind source separation of convolutive mixtures of speech, and acoustic signal processing for speech and audio applications. He received the ICA Unsupervised Learning Pioneer Award in 2006, the IEEE MLSP Competition Award in 2007, the TELECOM System Technology Award in 2004, the Achievement Award of the Institute of Electronics, Information, and Communication Engineers (IEICE) in 1997, and the Outstanding Technological Development Award of the Acoustical Society of Japan (ASJ) in 1995, the Paper Award of the IEICE in 2005 and 2002, the Paper Award of the ASJ in 2005 and 2002. He is the author or co-author of more than 200 articles in journals and conference proceedings and is responsible for more than 150 patents. He was a Keynote Speaker at ICA2007, a Tutorial speaker at ICASSP2007. He has served on IEEE SPS Awards Board (2006-08) and IEEE SPS Conference Board (2002-04). He is a member of the James L. Flanagan Speech & Audio Processing Award Committee. He was an Associate Editor of the IEEE Transactions on Speech and Audio Processing (2002-05) and is an Associate Editor of the EURASIP Journal on Advances in Signal Processing. He is a member of SPS Audio and Acoustics Signal Processing Technical Committee and the Chair of the Blind Signal Processing Technical Committee of the IEEE Circuits and Systems Society. He was the Vice President of the Engineering Sciences Society of the IEICE (2007-08), and the Chair of the Engineering Acoustics Technical Committee of the IEICE (2006-08). He is a member of the International IWAENC Standing committee and a member of the International ICA Steering Committee. He was the General Chair of WASPAA2007, the General Chair of IWAENC2003, the Organizing Chair of ICA2003, and is the designated Plenary Chair of ICASSP2012. Dr. Makino is an IEEE SPS Distinguished Lecturer (2009-10), an IEEE Fellow, an IEICE Fellow, a council member of the ASJ, and a member of EURASIP. Shoji Makino Hiroshi Sawada received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1991, 1993, and 2001, respectively. He joined NTT Corporation in 1993. He is now the Group Leader of Learning and Intelligent Systems Research Group at the NTT Communication Science Laboratories, Kyoto, Japan. His research interests include statistical signal processing, audio source separation, array signal processing, machine learning, latent variable model, graph-based data structure, and computer architecture. From 2006 to 2009, he served as an associate editor of the IEEE Transactions on Audio, Speech and Language Processing. He is a member of the Audio and Acoustic Signal Processing Technical Committee of the IEEE Signal Processing Society. He received the Ninth TELECOM System Technology Award for Student from the Telecommunications Advancement Foundation in 1994, the Best Paper Award of the IEEE Circuits and System Society in 2000, and the MLSP Data Analysis Competition Award in 2007. He was a Tutorial speaker at ICASSP2007. He served as a member of the Evaluation Organizing Committee (Audio Committee) of SiSEC 2010 (Signal Separation Evaluation Campaign) featured at LVA/ICA 2010. Hiroshi Sawada |
Functional Data Analysis for Speech Research |
|
Michele Gubian ( Centre for Language and Speech Technology in Nijmegen, the Netherlands) | |
AbstractThe analysis of the speech signal often requires dealing with data in the form of contours in time, like F0, formants, intensity, etc. Contour analysis usually boils down to (manual) identification of a set of 'important points', like peaks, valleys and elbows, whose coordinates serve as numerical shape descriptors. Those descriptors are a suitable input for classic statistical tools like Principal Component Analysis (PCA), linear regression, ANOVA. This approach incurs several problems. First, one needs to decide in advance which shape traits are relevant and which are not. For example, by deciding to describe contours only in terms of peak/valley points one implicitly excludes that aspects like concavity/convexity may play a role. Second, the identification of peaks and valleys can be difficult and potentially ambiguous (for example an intonational *H target may be physically realized as a plateau in an F0 contour). Finally, the identification of the reference points often depends on the judgement of trained listeners, which introduces the risk of bias, issues with respect to inter-annotator agreement, and high costs. The purpose of this tutorial is to introduce Functional Data Analysis (FDA) as a solution to the aforementioned problems. FDA is a suite of statistical tools introduced in the 90s by J. Ramsay and colleagues. FDA modifies well-known techniques like PCA and linear regression in such a way that they can take whole contours (functions of time) as input variables, as opposed to fixed length vectors of numbers. In this way, all the information contained in the shape of contours is preserved and used in the analysis. As a result, the intermediate step of selecting the relevant shape traits and (manually) performing the required measurements is eliminated. In this tutorial, I will show how to carry out FDA on a set of contours, mainly using F0 and formants as examples. I will show how to obtain a functional representation of sampled contours (e.g. F0 from Praat), how to deal with signals with different duration, how to apply functional PCA (FPCA) and how to interpret the graphical and numerical results. Then we turn to more sophisticated approaches, such as the analysis of more than one feature at once (e.g. F0 and intensity, or F0 and speech rate). Finally, I will show the use of FPCA as an exploration tool for speech re-synthesis, as commonly used for stimuli manipulation for perceptual experiments. This tutorial is of direct interest for researchers in prosody, phonetics, and speech analysis in general (e.g. pathological speech), and for everyone who manipulates the speech signal for perceptual experiments. Moreover, FDA techniques are of interest for the analysis of all time signals, including marker trajectories used in articulation and gesture analysis (like EMA), eye tracking studies, EEG, etc. Attendees should have basic knowledge of multi-dimensional statistics. No advanced mathematical skills are required. Since the reference FDA tool is written in R, basic experience with the R software environment will be beneficial. |
|
Short Bio:Michele Gubian obtained his Master in Telecommunication Engineering at Politecnico di Milano, Italy, in 2004 with a thesis on performance evaluation for ASR systems carried out at STMicroelectronics Labs in Milan. More information on Michele's research on FDA can be found on his website at: https://lands.let.ru.nl/FDA
|
Registers and Resonances in Singing |
|
Joe Wolfe (University of of New South Wales) |
|
John Smith (University of of New South Wales) | |
Ma�va Garnier (GIPSA-lab Grenoble) | |
AbstractSinging typically involves a much wider range of fundamental frequency and sound level than normal speech, and consequently samples a broader volume of parameter space. Furthermore, both the laryngeal mechanism and the resonance frequencies of the tract play important roles. Fortunately, singers are skilled at controlling vocal parameters and providing samples that can last several seconds. This tutorial reviews, aimed a speech researchers, the challenges of singing in the light of recent research, and gives an extended introduction to some of the techniques used to study laryngeal mechanisms and tract resonances in the singing voice. To cover the six octave range available to singing, three different laryngeal mechanism are available. Mechanism M1 (men's normal or women's 'chest' voice) includes the vocalis muscle in vocal fold vibration. In M2 (falsetto or 'head' voice), only a surface layer of the folds vibrates. The M1-M2 transition occurs around 400 Hz (~G4) for both sexes and requires techniques, particularly by altos and tenors, to displace or to disguise it. The transition from M2 to M3 (whistle voice) falls typically around 1 kHz and is used by some sopranos for coloratura, pop and jazz. Considerable detail about the mechanism is obtained non-invasively by electroglottography, in which the high frequency electrical admittance is measured between pairs of electrodes placed on the neck at the level of the larynx. Nasendoscopy uses a camera (either high speed video or stroboscopic) to view the glottis from above, via fibre optic cable inserted through the nose. As in speech, the first two tract resonances, R1 and R2, carry phoneme information. In singing, they and the higher resonances can also give a useful boost to the radiated power when a voice harmonic falls near a resonance. The higher resonances R3 to R5 occur around a few kHz, where not only is hearing most sensitive, but also orchestras are less powerful: these are thought to contribute to the 'singers formant', a band of enhanced power around 3 kHz.� For some voice ranges, the wide spacing of harmonics, and the fact that f0 can exceed the usual values of R1, require resonance tuning strategies, such as tuning R1 to f0. As well as producing a power boost, this tuning may improve voice stability. Systematic R1:f0 tuning is widely observed in sopranos and the upper limit of R1:f0 tuning often limits their range. Some, however, can then tune R2 to f0 and gain an additional octave or more using the M3 mechanism. The tutorial will discuss the various strategies used by different voice ranges that involve tuning f0 or one of its harmonics to R1 and/or R2. The tutorial will also introduce the technique of measuring the Ri precisely during singing using broadband excitation at the mouth. |
|
Short Bios:Ma�va Garnier is a post-doctoral researcher at GIPSA-lab in Grenoble. In 2007, she obtained her Ph.D thesis in acoustic phonetics at the University of Paris 6, for her work on speech communication in noisy environments. From 2007 to 2010, she worked as a research associate in the Music Acoustics Group of the University of New South Wales in Sydney on vocal registers and vocal tract adjustments in speech and singing. From 2010, she started working at the Speech and Cognition Department of GIPSA-lab in Grenoble on the neural correlates of speech imitation and phonetic convergence. John Smith is an associate professor of physics at the University of New South Wales. Originally his research was focused on the electrical properties of plant cell membranes, with some emphasis on their electrical impedance. In collaboration with his colleague, Joe Wolfe, he has used his experience with computer programming and interfacing to develop apparatus that can study the detailed acoustic properties of the vocal tract, musical instruments, and their interaction in performance. Joe Wolfe is a professor of physics at the University of New South Wales (Sydney). He previously held postdoctoral positions at Cornell University (NY) and the CSIRO (Canberra) and invited professorship at the Ecole Normale Sup�rieure (Paris). Late last century, he and colleague John Smith established a laboratory to investigate musical instruments, including the voice. Among other tools, they developed techniques to study the roles of vocal tract resonances in the voice and wind instrument performance. The team publishes regularly in high profile journals, and also maintains extensive web sites on music and physics, including voice acoustics, for the benefit of teachers and students. |
Automatic Summarization |
|
Ani Nenkova (Univ. of Pennsylvania) |
|
Sameer Maskey (IBM Research) | |
Yang Liu (Univ. of Texas at Dallas) | |
AbstractIn the past decade, we have seen that the amount of digital data, such as news, scientific articles, blogs, conversations, increases at an exponential pace. The need to address �information overload' by developing automatic summarization systems has never been more pressing. At the same time, approaches and algorithms for summarization have matured and increased in complexity, and interest in summarization research has intensified, with numerous publications on the topic each year. A newcomer to the field may find navigating the existing literature to be a daunting task. In this tutorial, we aim to give a systematic overview of traditional and more recent approaches for text and speech summarization.� A core problem in summarization research is devising methods to estimate the importance of a unit, be it a word, clause, sentence or utterance, in the input.� A few classical methods will be introduced, but the overall emphasis will be on most recent advances. We will cover log-likelihood test for topic word discovery and graph-based models for sentence importance, and will discuss semantically rich approaches based on latent semantic analysis, lexical resources. We will then turn to the most recent Bayesian models of summarization.�For supervised machine learning approaches, we will discuss the suite of traditional features used in summarization, as well as issued with data annotation and acquisition. Ultimately, the summary will be a collection of important units. The summary can be selected in a greedy manner, choosing the most informative sentence, one by one; or the units can be selected jointly, and optimized for informativeness. We discuss both approaches, with emphasis on recent optimization work. In the part on evaluation we will discuss the standard manual and automatic metrics for evaluation, as well as very recent work on fully automatic evaluation. We then turn to domain specific summarization, particularly summarization of scientific articles and speech data (telephone conversations, broadcast news, meetings and lectures). In speech, the acoustic signal �brings more information that can be exploited as features in summarization, but also poses unique problems which we discuss related to disfluencies, lack of sentence or clause boundaries, and recognition errors. We will only briefly touch on key but under-researched issues of linguistic quality of summaries, deeper semantic analysis for summarization, and abstractive summarization. Outline: 1. Computing informativeness 2. Optimizing informativeness and minimizing redundancy 3. Evaluation 4. Domain specific summarization |
|
Short Bios:Ani Nenkova is an Assistant Professor of Computer and Information Science at the University of Pennsylvania. She has worked extensively in the area of text summarization and evaluation of text summarization. She has recently developed methods for fully automatic methods for evaluation of both linguistic quality and content selection in summarization. Univ. of Pennsylvania Sameer Maskey is a Research Staff Member at IBM Research in Yorktown Heights, New York. His main research interests are statistical techniques for Natural Language and Speech processing, particularly Machine Translation and Summarization of spoken documents. He has previously worked on other topics such as Information Extraction, Speech Synthesis and Question Answering. IBM Research Yang Liu is an Assistant Professor of Computer Science at the University of Texas at Dallas. Her research interests are in a broad range of topics in speech and language processing, including summarization, spoken language understanding, prosody modeling in speech, emotion recognition, NLP for informal domains, and using speech and language technology for detection of communication disorders. Univ. of Texas at Dallas |
Sincerely,
Maurizio Omologo
Tutorials Chair
© 2011 Piero Cosi ISTC-CNR, Padova, ITALY | Last modified: 19:02 19-Sep-2011. |