Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-O1:
Speaker Recognition - Modeling, Automatic Procedures, Analysis II

Time: Monday 10:00 Place: Auditorium - Pala Congressi Type: Oral

Chair: Kornel Laskowski

10:00 Data-driven Gaussian Component Selection for Fast GMM-Based Speaker Verification
Ce Zhang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
Rong Zheng (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
In this paper, a fast likelihood calculation of Gaussian mixture model (GMM) is presented, by means of dividing the acoustic space into disjoint subsets and then assigning the most relevant Gaussians to each of them. The data-driven approach is explored to select Gaussian component which guarantees that the loss, brought by pre-discarding most useless Gaussians, can be easily controlled by a manual set parameter. To avoid the rapid growth of the index table size, a two level index scheme is proposed. We adjust several set of parameters to validate our work which is expected to speed up the computation while maintaining the performance. The results of the experiments on the female part of the telephone condition of NIST SRE 2006 indicate that the speed can be improved up to 5 times over the GMM-UBM baseline system without performance loss.

10:20 Analysis of i-vector Length Normalization in Speaker Recognition Systems
Daniel Garcia-Romero (Department of Electrical and Computer Engineering, University of Maryland, College Park, MD)
Carol Y. Espy-Wilson (Department of Electrical and Computer Engineering, University of Maryland, College Park, MD)
We present a method to boost the performance of probabilistic generative models that work with i-vector representations. The proposed approach deals with the non-Gaussian behavior of i-vectors by performing a simple length normalization. This non-linear transformation allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions. Significant performance improvements are demonstrated on the telephone portion of NIST SRE 2010.

10:40 An Analysis Framework based on Random Subspace Sampling for Speaker Verification
Weiwu Jiang (Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Hong Kong S.A.R, China)
Zhifeng Li (Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong S.A.R, China)
Helen Meng (Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Hong Kong S.A.R, China)
Using Joint Factor Analysis (JFA) supervector for subspace analysis has many problems, such as high processing complexity and overfitting. We propose an analysis framework based on random subspace sampling to address these problems. In this framework, JFA supervectors are first partitioned equally and each partitioned subvector is projected on to a subspace by PCA. All projected subvectors are then concatenated and PCA is applied again to reduce the dimension by projection onto a low-dimensional feature space. Finally, we randomly sample this feature space and build classifiers for the sampled features. The classifiers are fused to produce the final classification output. Experiments on NIST SRE 2008 corpora demonstrate the effectiveness of the proposed framework.

11:00 Factor analysis back ends for MLLR transforms in speaker recognition
Nicolas Scheffer (SRI International)
Yun Lei (SRI International)
Luciana Ferrer (SRI International)
The purpose of this work is to show how recent developments in cepstral-based systems for speaker recognition can be leveraged for the use of Maximum Likelihood Linear Regression (MLLR) transforms. Speaker recognition systems based on MLLR transforms have shown to be greatly beneficial in combination with standard systems, but most of the advances in speaker modeling techniques have been implemented for cepstral features. We show how these advances, based on Factor Analysis, such as eigenchannel and ivector, can be easily employed to achieve very high accuracy. We show that they outperform the current state-of-the-art MLLR-SVM system that SRI submitted during the NIST SRE 2010 evaluation. The advantages of leveraging the new approaches are manyfold: the ability to process a large amount of data, working in a reduced dimensional space, importing any advances made for cepstral systems to the MLLR features, and the potential for system combination at the ivector level.

11:20 Report on Performance Results in the NIST 2010 Speaker Recognition Evaluation
Craig S. Greenberg (National Institute of Standards and Technology)
Alvin F. Martin (National Institute of Standards and Technology)
Bradford N. Barr (National Institute of Standards and Technology)
George R. Doddington (Unaffiliated)
In the spring of 2010, the National Institute of Standards and Technology organized a Speaker Recognition Evaluation in which several factors believed to affect the performance of speaker recognition systems were explored. Among the factors considered in the evaluation were channel conditions, duration of training and test segments, number of training segments, and level of vocal effort. New cost function parameters emphasizing lower false alarm rates were used for two of the tests in the evaluation, and the reduction in false alarm rates exhibited by many of the systems suggests that the new measure may have helped to focus research on the low false alarm region of operation, which is important in many applications.

11:40 iVector Fusion of Prosodic and Cepstral Features for Speaker Verification
Marcel Kockmann (Brno University of Technology)
Luciana Ferrer (SRI International)
Lukas Burget (Brno University of Technology)
Jan Cernocky (Brno University of Technology)
In this paper we apply the promising iVector extraction technique followed by PLDA modeling to simple prosodic contour features. With this procedure we achieve results comparable to a system that models much more complex prosodic features using our recently proposed SMM-based iVector modeling technique. We then propose a combination of both prosodic iVectors by joint PLDA modeling that leads to significant improvements over individual systems with an EER of 5.4% on NIST SRE 2008 telephone data. Finally, we can combine these two prosodic iVector front ends with a baseline cepstral iVector system to achieve up to 21% relative reduction in new DCF.

Technical Programme

Mon-Ses1-O1:Speaker Recognition - Modeling, Automatic Procedures, Analysis II

Mon-Ses1-O1:
Speaker Recognition - Modeling, Automatic Procedures, Analysis II