|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Sun-Ses2-P2: Speech Enhancement
Time: | Sunday 13:30 |
Place: | Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) |
Type: | Poster |
Chair: | Dietrich Klakow |
#1 | Evaluating artificial bandwidth extension by conversational tests in car using mobile devices with integrated hands-free functionality
Laura Laaksonen (Nokia, Symbian Smartphones, Audio Technology, Finland) Ville Myllylä (Nokia, Symbian Smartphones, Audio Technology, Finland) Riitta Niemistö (Nokia, Symbian Smartphones, Audio Technology, Finland)
This paper describes an artificial bandwidth extension (ABE) method that generates new high frequency components to a narrowband signal by folding specifically gained subbands to frequencies from 4 kHz to 7 kHz, and improves the quality and intelligibility of narrowband speech in mobile devices. The proposed algorithm was evaluated by subjective listening tests. In addition, rarely used conversation test was constructed. Speech quality of 1) narrowband phone call, 2) wideband phone call, and 3) narrowband phone call enhanced with ABE were evaluated in conversational context using mobile devices with integrated hands-free (IHF) functionality. The results indicate that in IHF use case, ABE quality overcomes narrowband speech quality both in car noise and in quiet environment.
|
#2 | Low-Frequency Bandwidth Extension of Telephone Speech Using Sinusoidal Synthesis and Gaussian Mixture Model
Hannu Pulakka (Department of Signal Processing and Acoustics, Aalto University, Finland) Ulpu Remes (Adaptive Informatics Research Centre, Aalto University, Finland) Santeri Yrttiaho (Department of Signal Processing and Acoustics, Aalto University, Finland) Kalle Palomäki (Adaptive Informatics Research Centre, Aalto University, Finland) Mikko Kurimo (Adaptive Informatics Research Centre, Aalto University, Finland) Paavo Alku (Department of Signal Processing and Acoustics, Aalto University, Finland)
The limited audio bandwidth of narrowband telephone speech degrades the speech quality. This paper proposes a method that extends the bandwidth of telephone speech to the frequency range 0 - 300 Hz. The lowest harmonics of voiced speech are generated using sinusoidal synthesis. The energy in the extension band is estimated from spectral features using a Gaussian mixture model. The amplitudes and phases of the synthesized signal are adjusted based on the amplitudes and phases of the narrowband input speech. The proposed method was evaluated with listening tests together with a bandwidth extension method for the range 4 - 8 kHz. The low-frequency bandwidth extension was found to reduce dissimilarity with wideband speech but no perceived quality improvement was achieved.
|
#3 | Memory-Based Approximation of the Gaussian Mixture Model Framework for Bandwidth Extension of Narrowband Speech
Amr Nour-Eldin (McGill University, Montreal, Canada) Peter Kabal (McGill University, Montreal, Canada)
In this paper, we extend our previous work on exploiting speech temporal properties to improve Bandwidth Extension (BWE) of narrowband speech using Gaussian Mixture Models (GMMs). By quantifying temporal properties through information theoretic measures and using delta features, we have shown that narrowband memory significantly increases certainty about highband parameters. However, as delta features are non-invertible, they can not be directly used to reconstruct highband frequency content. In the work presented herein, we embed temporal properties indirectly into the GMM structure through a memory-dependent tree-based approach to extend representation of the narrow band. In particular, sequences of past frames are progressively used to grow the GMM in a tree-like fashion. This growth approach results in reliable estimates for the GMM parameters such that Maximum Likelihood estimation is no longer necessary, thus circumventing the complexity accompanying high-dimensionality GMM training.
|
#4 | Speech enhancement by reconstruction from cleaned acoustic features
Philip Harding (University of East Anglia) Ben Milner (University of East Anglia)
This paper proposes a novel method of speech enhancement that moves away from conventional filtering-based methods and instead aims to reconstruct clean speech from a set of speech features. Underlying the enhancement system is a speech model which at present is based on a sinusoidal model. This is driven by a set of speech features, comprising voicing, fundamental frequency and spectral envelope, that are extracted from the noisy speech. A maximum a posteriori approach is proposed for estimating clean spectral envelope features from the noisy spectral envelope. A set of subjective tests, measuring speech quality, noise intrusiveness and overall quality, found the proposed method to be highly effective at removing noise. Comparison against conventional speech enhancement methods found performance to be equivalent to Wiener filtering.
|
#5 | A Soft Decision-based Speech Enhancement using Acoustic Noise Classification
Jae-Hun Choi (Hanyang University) Sang-Kyun Kim (Hanyang University) Joon-Hyuk Chang (Hanyang University)
In this letter, we present a speech enhancement technique based on the ambient noise classification incorporating the Gaussian mixture model (GMM). The principal parameters of the statistical model-based speech enhancement algorithm such as the weighting parameter in the decision-directed (DD) method and the long-term smoothing parameter of the noise estimation, are chosen as different values according to the classified contexts to ensure best performance for each noise. For the real-time environment awareness, the noise classification is performed on a frame-by-frame basis using the GMM with the soft decision framework. The speech absence probability (SAP) is used in detecting the speech absence periods and updating the likelihood of the GMM.
|
#6 | A Noise Estimation Method Based on Speech Presence Probability and Spectral Sparseness
Chao Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,) Wenju Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,)
This paper addresses the problem of noise power spectrum estimation. Existing noise estimation methods cannot perform quite reliably when noise level increasing abruptly (e.g., narrowband noise burst). To overcome this problem, we improve the time-recursive averaging algorithm based on speech presence probability (SPP), by exploiting the sparseness of speech spectrum. Firstly, we utilize the SPP estimation method based on fixed priors to achieve low SPP estimates at time-frequency bins where speech is absent. Furthermore, a spectral sparseness measure is proposed to adjust the SPP estimates. Experiments show the proposed method can update the noise estimates faster than state-of-the-art approaches in both stationary and nonstationary noise.
|
#7 | Improved a posteriori Speech Presence Probability Estimation Based on Cepstro-Temporal Smoothing and Time-Frequency Correlation
Chao Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,) Wenju Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,)
In this paper, we present a novel estimator for the SPP at each time-frequency point in the short-time Fourier transform (STFT) domain. Existing speech presence probability (SPP) estimators cannot perform quite reliably in nonstationary noise environment when applied to a speech enhancement task. To overcome this limitation, we propose a novel SPP estimation method. Firstly, the spectral outliers are eliminated by selectively smoothing the maximum likelihood estimate of a priori signal-noise ratio (SNR) in the cepstral domain. Furthermore, an adaptive tracking for a priori SPP is derived by exploiting the strong correlation of speech presence in neighboring frequency bins of consecutive frames. The proposed approach outperforms the state-of-the-art approaches, resulting in less noise leakage and low speech distortions in both stationary and nonstationary noise environments.
|
#8 | A Rapid Adaptation Algorithm for Tracking Highly Non-Stationary Noises Based on Bayesian Inference for On-Line Spectral Change Point Detection
Md Foezur Rahman Chowdhury Chowdhury (INRS-EMT, Université du Québec, Montreal, QC, Canada) Sid-Ahmed Selouani (Université de Moncton, Campas de Shippagan, NB, Canada) Douglas O\'Shaughnessy (INRS-EMT, Université du Québec, Montreal, QC, Canada)
This paper presents an innovative rapid adaptation technique for tracking highly non-stationary acoustic noises. The novelty of this technique is that it can detect the acoustic change points from the spectral characteristics of the observed speech signal in rapidly changing non-stationary acoustic environments. The proposed innovative noise tracking technique will be very suitable for joint additive and channel distortions compensation (JAC) for on-line automatic speech recognition (ASR). The Bayesian on-line change point detection (BOCPD) approach is used to implement this technique. The proposed algorithm is tested using highly non-stationary noisy speech samples from the Aurora2 speech database. Significant improvement in minimizing the delay in adaptation to new acoustic conditions is obtained for highly non-stationary noises compared to the most popular baseline noise tracking algorithm MCRA and its derivatives.
|
#9 | Single channel speech enhancement using MMSE estimation of short-time modulation magnitude spectrum
Kuldip Paliwal (Signal Processing Laboratory, School of Engineering, Griffith University, Australia) Belinda Schwerin (Signal Processing Laboratory, School of Engineering, Griffith University, Australia) Kamil Wojcicki (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)
In this paper we investigate the enhancement of speech by applying MMSE short-time spectral magnitude estimation in the modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is extended to include modulation domain processing. We compensate the noisy modulation spectrum for additive noise distortion by applying the MMSE short-time spectral magnitude estimation algorithm in the modulation domain. Subjective experiments were conducted to compare the quality of stimuli processed by the MMSE modulation magnitude estimator to those processed using the MMSE acoustic magnitude estimator and the modulation spectral subtraction method. The proposed method is shown to have better noise suppression than MMSE acoustic magnitude estimation, and improved speech quality compared to modulation domain spectral subtraction.
|
#10 | Speech Enhancement Using Masking Properties in Adverse Environments
Atanu Saha (Graduate School of Science and Engineering, Saitama University, Saitama, Japan) Tetsuya Shimamura (Graduate School of Science and Engineering, Saitama University, Saitama, Japan)
In this paper, we propose a speech enhancement method by exploiting masking properties of human auditory system. The masking properties are exploited to calculate a masking threshold. The spectral components which lie above the threshold are audible to human listeners. These audible spectral components in the proposed method are suppressed as a predefined attenuation factor of the original noise. The evaluation is conducted in the experiments. The experimental results show that the proposed method provides significant performance compared to the conventional approaches.
|
#11 | Phoneme-dependent NMF for speech enhancement in monaural mixtures
Bhiksha Raj (Carnegie Mellon University) Rita Singh (Carnegie Mellon University) Tuomas Virtanen (Tampere University of Technology)
The problem of separating speech signals out of monaural mixtures (with other non-speech or speech signals) has become increasingly popular in recent times. Among the various solutions proposed, the most popular methods are based on compositional models such as non-negative matrix factorization (NMF) and latent variable models. Although these techniques are highly effective they largely ignore the inherently phonetic nature of speech. In this paper we present a phoneme-dependent NMF-based algorithm to separate speech from monaural mixtures. Experiments performed on speech mixed with music indicate that the proposed algorithm can result in significant improvement in separation performance, over conventional NMF-based separation.
|
#12 | Kernel PCA for Speech Enhancement
Christina Leitner (Graz University of Technology) Franz Pernkopf (Graz University of Technology) Gernot Kubin (Graz University of Technology)
In this paper, we apply kernel principal component analysis (kPCA), which has been successfully used for image de-noising, to speech enhancement. In contrast to other enhancement methods which are based on the magnitude spectrum, we rather apply kPCA to complex spectral data. This is facilitated by Gaussian kernels. In the experiments, we show good noise reduction with few artifacts for noise corrupted speech at different SNR levels using additive white Gaussian noise. We compared kPCA with linear PCA and spectral subtraction and evaluated all algorithms with perceptually motivated quality measures.
|
#13 | Objective Intelligibility Prediction of Speech by Combining Correlation and Distortion based Techniques
Angel Gomez (Dpt. of Signal Theory, Networking and Communications, University of Granada, Spain) Belinda Schwerin (Signal Processing Laboratory, School of Engineering, Griffith University, Australia) Kuldip Paliwal (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)
A number of techniques based on correlation measurements have recently been proposed to provide an objective measure of intelligibility. These techniques are able to detect nonlinear distortions and provide intelligibility scores highly correlated with those given by human listeners. However, the performance of these techniques has not been found satisfactory for measuring the speech intelligibility of speech enhancement algorithms. In this paper we first investigate the different correlation-based methods, in the context of speech enhancement. We then propose to combine these correlation-based techniques with spectral distance-based ones. Results presented show that objective intelligibility prediction is significantly improved by this combination.
|
|
|