|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-P3: Language, Dialect Identification and Speaker Diarization
Time: | Wednesday 10:00 |
Place: | Faenza 1 - Pala Congressi (Passi Perduti-Gallery) |
Type: | Poster |
Chair: | Nancy Chen |
#1 | Study on the Relevance Factor of Maximum a Posteriori with GMM for Language Recognition
Chang Huai You (Institute for Infocomm Research, Singapore) Haizhou Li (Institute for Infocomm Research, Singapore) Kong Aik Lee (Institute for Infocomm Research, Singapore)
In this paper, the relevance factor in maximum a posteriori (MAP) adaptation of Gaussian mixture model (GMM) from universal background model (UBM) is studied for language recognition. In conventional MAP, relevance factor is typically set as a constant empirically. Knowing that relevance factor determines how much the observed training data influence the model adaptation, thus the resulting GMM models, we believe that the relevance factor should be dependent to the data for more effective modeling. We formulate the estimation of relevance factor in a systematic manner and study its role in characterizing spoken languages with supervectors. We use a Bhattacharyya-based language recognition system on National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) 2009 task to investigate the validate of the data-dependent relevance factor. Experimental results show that we achieve improved performance by using the proposed relevance factor.
|
#2 | Improving Multiband Position Pitch Algorithm for Localization and Tracking of Multiple Concurrent Speakers by using a Frequency Selective Criterion
Tania Habib (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria) Harald Romsdorfer (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)
We present an auditory inspired frequency selective extension to the multiband position-pitch (MPoPi) algorithm and a new particle filtering algorithm for localization and tracking of an arbitrary number of concurrent speakers. In the particle filtering framework, we combine standard bootstrap with importance sampling techniques. The proposed algorithm was tested on real-world recordings using a 24 channel microphone array in a meeting room for different location and speaker combinations. The results show that using the frequency selective criterion outperforms state-of-the-art and our original algorithms.
|
#3 | On the Use of Lattices of Time-Synchronous Cross-Decoder Phone Co-occurrences in a SVM-Phonotactic Language Recognition System
Amparo Varona (University of the Basque Country) Mikel Penagarikano (University of the Basque Country) Luis Javier Rodriguez-Fuentes (University of the Basque Country) German Bordel (University of the Basque Country)
This paper presents a simple approach to phonotactic language recognition which uses Lattices of Time-Synchronous Cross-Decoder Phone Co-occurrences. In previous works we have successfully applied cross-decoder information, but using statistics of n-grams extracted from 1-best phone strings. In this work, the method to build and properly use lattices of cross-decoder phone co-occurrences is fully explained and developed. Experiments were carried out on the 2007 NIST LRE database.
The proposed approach outperformed the baseline phonotactic systems both considering 3-grams and 4-grams. Best results were obtained by considering the m=400 most likely cross-decoder coocurrences: 1.29% EER and CLLR=0.203.
The fusion of the baseline system with the proposed approach yielded 1.22% EER and CLLR=0.203 (18% and 15% relative improvements) for n=3, and 1.17% EER and CLLR=0.197 (15% and 10% relative improvements) for n=4, outperforming state-of-the-art phonotactic systems on the same task.
|
#4 | Speaker Clustering Based on Utterance-oriented Dirichlet Process Mixture Model
Naohiro Tawara (Department of Science and Engineering, Waseda University) Shinji Watanabe (NTT Communication Science Laboratories, NTT Corporation) Tetsuji Ogawa (Waseda Institute for Advanced Study) Tetsunori Kobayashi (Department of Science and Engineering, Waseda University)
This paper provides the analytical solution and algorithm of UO-DPMM based on a non-parametric Bayesian manner, and thus realizes fully Bayesian speaker clustering. We carried out preliminary speaker clustering experiments by using a TIMIT database to compare the proposed method with the conventional Bayesian Information Criterion (BIC) based method, which is an approximate Bayesian approach. The results showed that the proposed method outperformed the conventional one in terms of both computational cost and robustness to changes in tuning parameters.
|
#5 | PLDA-based Clustering for Speaker Diarization of Broadcast Streams
Jan Silovsky (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic) Jan Prazak (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic) Petr Cerva (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic) Jindrich Zdansky (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic) Jan Nouza (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic)
This paper presents two approaches to speaker clustering based on Probabilistic Linear Discriminant Analysis (PLDA) in the speaker diarization task. We refer to the approaches as the multifold-PLDA approach and the onefold-PLDA approach. For both approaches, simple factor analysis model is employed to extract low-dimensional representation of a sequence of acoustic feature vectors - so called i-vectors - and these i-vectors are modeled using the PLDA model. Further, two-stage clustering with Bayesian Information Criterion (BIC) based approach applied in the first stage and PLDA-based approach in the second stage is examined. We carried out our experiments using the COST278 multilingual broadcast news database. The best evaluated system yielded 42 % relative improvement of the speaker error rate over a baseline BIC-based system.
|
#6 | iVector Approach to Phonotactic Language Recognition
Mehdi Soufifar (Department of Electronics and Telecommunications, NTNU, Trondheim, Norway) Marcel Kockmann (Brno University of Technology, Speech@FIT, Czech Republic) Lukas Burget (Brno University of Technology, Speech@FIT, Czech Republic) Olda Plchot (Brno University of Technology, Speech@FIT, Czech Republic) Ondrej Glembek (Brno University of Technology, Speech@FIT, Czech Republic) Torbjørn Svendsen (Department of Electronics and Telecommunications, NTNU, Trondheim, Norway)
This paper addresses a novel technique for representation and processing of n-gram counts in phonotactic language recognition (LRE): subspace multinomial modelling represents the vectors of n-gram counts by low dimensional vectors of coordinates in total variability subspace, called iVector. Two techniques for iVector scoring are tested: support vector machines (SVM), and logistic regression (LR). Using standard NIST LRE 2009 task as our evaluation set, the latter scoring approach was shown to outperform phonotactic LRE system based on direct SVM classification of n-gram count vectors. The proposed iVector paradigm also shows comparable results to previously proposed PCA-based phonotactic feature extraction.
|
#7 | Discriminative Features For Language Identification
Christopher Alberti (Google Inc.) Michiel Bacchiani (Google Inc.)
In this paper we investigate the use of discriminatively trained feature transforms to improve the accuracy of a MAP-SVM language recognition system. We train the feature transforms by alternatively solving an SVM optimization on MAP supervectors estimated from transformed features, and performing a small step on the transforms in the direction of the antigradient of the SVM objective function. We applied this method on the LRE2003 dataset, and obtained an 5.9% relative reduction of pooled equal error rate.
|
#8 | Perceptual sensitivity to dialectal and generational variations in vowels
Robert Allen Fox (Department of Speech and Hearing Science, The Ohio State University) Ewa Jacewicz (Department of Speech and Hearing Science, The Ohio State University)
Perception of dialect variation is well studied with respect to perceptual similarity of talkers based on dialectal markers. This study examines the perceptual distinctiveness of regional vowel variants in light of cross-generational changes in vowel productions. Listeners from two regional dialects of English identified the dialect of the speaker in monosyllabic words (produced by older adults, young adults and children). Differential listener sensitivity to speaker dialect was found, which was highly affected by speaker generation. This suggests that the ability to determine dialect membership is an interaction between the perceptual spaces of listeners and the acoustic variations in vowels.
|
#9 | Investigation of Cross-show Speaker Diarization
Qian Yang (Cognitive Systems Lab, Karlsruhe Institute of Technology,Karlsruhe,Germany) Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology,Karlsruhe,Germany) Qin Jin (Language Technologies Institute, Carnegie Mellon University,USA)
The goal of cross-show diarization is to index speech segments of speakers for a set of shows, with the particular challenge that reappearing speakers across shows have to be assigned to the same speaker identity. In this paper, we introduce three cross-show diarization systems and present our initial experiments on the cross-show diarization task. Among the three systems, the Global-BIC-cluster achieves the best performance and obtains 15.53% and 13.21% cross-show diarization error rate (DER) on the dev and test set respectively. However, incremental approach is considered to be more effective in real life. By applying T-Norm on incremental system, we obtain 13.18% and 10.97% relative improvements in terms of cross-show DER on dev and test set. We also investigate the impact of the show order on cross-show DER for the incremental approach.
|
#10 | Language Identification for Text Chats
Vesa Siivola (Rosetta Stone) Bryan Pellom (Rosetta Stone) Meagan Sills (Rosetta Stone)
This work aims to classify the language of typed messages in a text chat system used by language learners. A method for training a language classifier from unlabeled data is presented. A dictionary-based method is used to produce initial classification of the messages. Character based n-gram models of order 3 and 5 are built. A method for selectively choosing the n-grams to be modeled is used to train 15-gram models. This method produces the best-performing classifier. It has models for 57 languages and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.
|
#11 | Spoken Language Recognition in the Latent Topic Simplex
Kong Aik Lee (Institute for Infocomm Research, Singapore) Chang Huai You (Institute for Infocomm Research, Singapore) Ville Hautamäki (Institute for Infocomm Research, Singapore) Anthony Larcher (Institute for Infocomm Research, Singapore) Haizhou Li (Institute for Infocomm Research, Singapore)
This paper proposes the use of latent topic modeling for spoken language recognition, where a topic is defined as a discrete distribution over phone n-grams. The latent topics are trained in an unsupervised manner using the latent Dirichlet allocation (LDA) technique. Language recognition is then performed in a low dimensional simplex defined by the latent topics. We apply the Bhattacharyya measure to compute the n-gram similarity in the topic simplex. Our study shows that some of the latent topics are language specific while others exhibit multilingual characteristic. Experiment conducted on the NIST 2007 language detection task shows that language cues can be sufficiently preserved in the topic simplex.
|
|
|