|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Mon-Ses1-O3: Acoustic Event Detection
Time: | Monday 10:00 |
Place: | Brunelleschi (Green Room) - Pala Congressi - 2nd Floor |
Type: | Oral |
Chair: | Dirk van Compernolle |
10:00 | Learning new acoustic events in an HMM-based system using MAP adaptation
Jürgen Thomas Geiger (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany) Mohamed Anouar Lakhal (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany) Björn Schuller (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany) Gerhard Rigoll (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
In this paper, we present a system for the recognition of acoustic events suited for a robotic application.
HMMs are used to model different acoustic event classes.
We are especially looking at the open-set case, where a class of acoustic events occurs that was not included in the training
phase.
It is evaluated how newly occuring classes can be learnt using MAP adaptation or conventional training methods.
A small database of acoustic events was recorded with a robotic platform to perform the experiments.
|
10:20 | Alternative Frequency Scale Cepstral Coefficient for Robust Sound Event Recognition
Yiren Leng (Institute for Infocomm Research, A*STAR, Singapore) Huy Dat Tran (Institute for Infocomm Research, A*STAR, Singapore) Norihide Kitaoka (Nagoya University, Japan) Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)
There are two issues when applying MFCC for sound event recognition: 1) sound events have a broader spectral range than speech thus the log-frequency scale is less informative; 2) low frequency noise is more prevalent thus the log-frequency scale captures more noise. To address these issues, we study two alternative frequency scales and show that they outperform MFCCs for sound event recognition under mismatch conditions using SVMs without the need for complex algorithms.
|
10:40 | Evaluation of Abnormal Sound Detection using Multi-stage GMM in Various Environments
Akinori Ito (Graduate School of Engineering, Tohoku University) Akihito Aiba (Graduate School of Engineering, Tohoku University) Masashi Ito (Tohoku Institute of Technology) Shozo Makino (Tohoku Bunka Gakuen University)
We have been developing a method to automatically detect
incidents by detecting abnormal sound events from audio signal
recorded in real environments. The proposed method uses the
multi-stage Gaussian Mixture Models (GMM) that learns rare
sounds using multiple GMMs. In this work, we investigated
relationship between sound environment and detection
performance, and we found that the performance deteriorates in
noisy environments. The performance largely depended on SN
ratio of the abnormal sounds. Next, we investigated methods
for determining hyperparameters of the multi-stage GMM,
which involves intermediate thresholds, numbers of mixture of
GMMs and the detection threshold. From the experimental
results, combination of Percentile-based threshold
determination and Bayesian information criterion (BIC)-based
mixture determination was most effective. However, when
using the automatically-determined parameters, the detection
performance deteriorated up to 20%.
|
11:00 | Unsupervised learning of acoustic events using dynamic time warping and hierarchical K-means++ clustering
Joerg Schmalenstroeer (Department of Communications Engineering, University of Paderborn, Germany) Markus Bartek (Department of Communications Engineering, University of Paderborn, Germany) Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn, Germany)
In this paper we propose to jointly consider Segmental Dynamic Time Warping
and distance clustering for the unsupervised learning of acoustic events. As a
result, the computational complexity increases only linearly with the dababase
size compared to a quadratic increase in a sequential setup, where all pairwise
SDTW distances between segments are computed prior to clustering. Further, we
discuss options for seed value selection for clustering and show that drawing
seeds with a probability proportional to the distance from the already
drawn seeds, known as K-means++ clustering, results in a significantly
higher probability of finding representatives of each of the underlying classes,
compared to the commonly used draws from a uniform distribution. Experiments are
performed on an acoustic event classification and an isolated digit recognition
task, where on the latter the final word accuracy approaches that of supervised training.
|
11:20 | Feature Extraction Assessment for an Acoustic-Event Classification Task using the Entropy Triangle
David Mejía-Navarrete (Universidad Carlos III de Madrid) Ascensión Gallardo-Antolín (Universidad Carlos III de Madrid) Carmen Peláez-Moreno (Universidad Carlos III de Madrid) Francisco J. Valverde-Albacete (Universidad Carlos III de Madrid)
We assess the behaviour of $5$ different feature extraction methods for an acoustic event classification task---built using the same SVM underlying technology---by means of two different techniques: accuracy and the entropy triangle. The entropy triangle is able to find a classifier instance whose relatively high accuracy stems from an attempt to specialize in some classes to the detriment of the overall behaviour. On all other cases, fair classifiers, accuracy and entropy triangle agree.
|
11:40 | Unsupervised Audio Analysis for Categorizing Heterogeneous Consumer Domain Videos
Pradeep Natarajan (Raytheon BBN Technologies) Stavros Tsakalidis (Raytheon BBN Technologies) Vasant Manohar (Raytheon BBN Technologies) Rohit Prasad (Raytheon BBN Technologies) Prem Natarajan (Raytheon BBN Technologies)
The ever increasing volume of consumer domain videos on the Internet has led to a surge in interest in automatically analyzing such content. The audio signal in these videos contains salient information, but applying current automatic speech recognition (ASR) techniques is not viable due to high variability, noise and multilingual content. We present two unsupervised techniques which do not rely on ASR to address these challenges. The first method involves learning an unsupervised codebook by clustering audio features, and the second involves directly matching low-level features using the pyramid match kernel (PMK). Experimental results on a ~200 hour audio corpus downloaded from YouTube show that both our approaches significantly outperform the traditional approach of first segmenting the audio stream to a set of mid-level classes (e.g. speech, non-speech, music, silence) and using the duration statistics of these classes to train high-level classifiers.
|
|
|