Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses2-O1:
Speaker Diarization II
Time: | Wednesday 13:30 |
Place: | Auditorium - Pala Congressi |
Type: | Oral |
Chair: | Hagai Aronowitz |
13:30 | Prosodic and Phonetic Features for Speaker Clustering in Speaker Diarization Systems
Janez Zibert (Department of Information Sciences and Technology, University of Primorska, Koper, Slovenia) France Mihelic (Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia)
This paper is focused on speaker clustering methods that are used in speaker
diarization systems. We concentrate on developing proper representations of speaker segments for clustering and research different similarity measures for joining speaker segments.
We realize two speaker clustering systems. The first is a standard approach using a bottom-up agglomerative clustering principle with the BIC as a merging criterion.
In the second system we developed a fusion-based speaker-clustering, where speaker segments are modeled by acoustic and prosodic representations. In this way we additionally model the speaker prosodic characteristics and combine them with the basic acoustic information of speakers, which leads to improved clustering of the segments in the case of similar speaker acoustic properties and poor acoustic conditions.
|
13:50 | Diarization-based Speaker Retrieval for Broadcast Television Archives
Marijn Huijbregts (Radboud University Nijmegen, Centre for Language and Speech Technology) David Leeuwen van (Radboud University Nijmegen, Centre for Language and Speech Technology)
In this study we extend a query-by-example diarization-based speaker retrieval system to a full speaker retrieval system for broadcast television. The envisioned system is capable of finding all speakers in an archive using their names instead of example speech fragments. Information extracted from a television guide is used to label speaker clusters that most likely correspond to the found names. As part of the labeling process, all speaker clusters are first classified automatically based on their role in the programs they appear in. The role classification accuracy is 64% on our evaluation set. Speaker names can automatically be attributed to a fraction of the speaker clusters with an accuracy of 70%.
|
14:10 | The detection of overlapping speech with prosodic features for speaker diarization
Martin Zelenák (Universitat Politecnica de Catalunya) Javier Hernando (Universitat Politecnica de Catalunya)
Overlapping speech is responsible for a certain amount of errors produced by standard speaker diarization systems in meeting environment. We are investigating a set of prosody-based long-term features as a potential complement to our overlap detection system relying on short-term spectral parameters. The most relevant features are selected in a two-step process. They are firstly evaluated and sorted according to mRMR criterion and then the optimal number is determined by iterative wrapper approach. We show that the addition of prosodic features decreased overlap detection error. Detected overlap segments are used in speaker diarization to recover missed speech by assigning multiple speaker labels and to increase the purity of speaker clusters.
|
14:30 | LP Residual Features for Robust, Privacy-Sensitive Speaker Diarization
Sree Hari Krishnan Parthasarathi (Idiap Research Institute, EPFL) Herve Bourlard (Idiap Research Institute, EPFL) Daniel Gatica-Perez (Idiap Research Institute, EPFL)
We present a comprehensive study of linear prediction residual for speaker diarization on single and multiple distant microphone conditions in privacy-sensitive settings, a requirement to analyze a wide range of spontaneous conversations. Two representations of the residual are compared, namely real-cepstrum and MFCC, with the latter performing better. Experiments on RT06eval show that residual with subband information from 2.5 kHz to 3.5 kHz and spectral slope yields a performance close to traditional MFCC features. As a way to objectively evaluate privacy in terms of linguistic information, we perform phoneme recognition. Residual features yield low phoneme accuracies compared to traditional MFCC features.
|
14:50 | Extending the Task of Diarization to Speaker Attribution
Houman Ghaemmaghami (Queensland University of Technology) David Dean (Queensland University of Technology) Robbie Vogt (Queensland University of Technology) Sridha Sridharan (Queensland University of Technology)
In this paper we extend the concept of speaker annotation within a single-recording, or speaker diarization, to a collection wide approach we call speaker attribution. Accordingly, speaker attribution is the task of clustering expectantly homogenous inter-session clusters obtained using diarization according to common cross-recording identities. The result of attribution is a collection of spoken audio across multiple recordings attributed to speaker identities. In this paper, an attribution system is proposed using mean-only MAP adaptation of a combined-gender UBM to model clusters from a perfect diarization system, as well as a JFA-based system with session variability compensation. The normalized cross-likelihood ratio is calculated for each pair of clusters to construct an attribution matrix and the complete linkage algorithm is employed to conduct clustering of the inter-session clusters. A matched cluster purity and coverage of 87.1% was obtained on the NIST 2008 SRE corpus.
|
15:10 | Comparing Multi-Stage Approaches for Cross-Show Speaker Diarization
Viet-Anh Tran (LIMSI-CNRS) Viet Bac Le (Vocapia Research) Claude Barras (LIMSI-CNRS) Lori Lamel (LIMSI-CNRS)
Acoustic speaker diarization is investigated for situations where a collection of shows from the same source needs to be processed. In this case, the same speaker should receive the same label across all shows. We compare different architectures for cross-show speaker diarization: the obvious concatenation of all shows, a hybrid system combining a local first clustering stage with a second global stage, and an incremental system which processes the shows in a predefined order and updates the speaker models accordingly, this latter system being best suited to real applicative situations. These three strategies were compared to a baseline system on a set of 46 ten-minutes samples of British English scientific podcasts.
|