Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses1-O5:
Speech Enhancement analysis and Evaluation

Time: Monday 10:00 Place: Raffaello - Pala Affari - 3rd Floor Type: Oral

Chair: Doug O'Shaughnessy

10:00 Theoretical analysis of musical noise and speech distortion in structure-generalized parametric blind spatial subtraction array
Ryoichi Miyazaki (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)
In this paper, we propose the structure-generalized parametric blind spatial subtraction array (BSSA) and its theoretical analysis of amounts of musical noise and speech distortion is conducted via higher-order statistics. We theoretically prove a tradeoff between the amount of musical noise and speech distortion in various BSSA. Also we reveal that the best speech recognition performance can be obtained when a lower exponent parameter is used in parametric BSSA.

10:20 Subjective and objective evaluation of speech intelligibility enhancement under constant energy and duration constraints
Yan Tang (Language and Speech Laboratory, Universidad del Pais Vasco)
Martin Cooke (Ikerbasque (Basque Science Foundation))
Speakers appear to adopt strategies to improve speech intelligibility for interlocutors in adverse acoustic conditions. Generated speech, whether synthetic, recorded or live, may also benefit from context-sensitive modifications in challenging situations. The current study measured the effect on intelligibility of six spectral and temporal modifications operating under global constraints of constant input-output energy and duration. Reallocation of energy from mid-frequency regions with high local SNR produced the largest intelligibility benefits, while other approaches such as pause insertion or maintenance of a constant segmental SNR actually led to a deterioration in intelligibility. Listener scores correlated only moderately well with recent objective intelligibility estimators, suggesting that further development of intelligibility models is required to improve predictions for modified speech.

10:40 A Risk-Estimation-Based Comparison of Mean Square Error and Itakura-Saito Distortion Measures for Speech Enhancement
Nagarjuna Reddy Muraka (Indian Institute of Science)
Chandra Sekhar Seelamantula (Indian Institute of Science)
The goal of speech enhancement algorithms is to provide an estimate of clean speech starting from noisy observations. In general, the estimate is obtained by minimizing a chosen distortion metric. The often-employed cost is the mean-square error (MSE), which results in a Wiener-filter solution. Since the ground truth is not available in practice, the practical utility of the optimal estimators is limited. Alternative, one can optimize an unbiased estimate of the MSE. This is the key idea behind Stein's unbiased risk estimation (SURE) principle. Within this framework, we derive SURE solutions for the MSE and Itakura-Saito (IS) distortion measures. We also propose parametric versions of the corresponding SURE estimators, which give additional flexibility in controlling the attenuation characteristics for maximum signal-to-noise-ratio (SNR) gain. We compare the performance of the two distortion measures in terms of attenuation profiles, average segmental SNR, global SNR, and visual inspection of spectrograms. We also include a comparison with the standard power spectral subtraction technique. The results show that the SURE-IS approach consistently gives better performance gain than SURE-MSE. The perceived sound quality is also better in case of the SURE-IS estimator.

11:00 On Noise Tracking for Noise Floor Estimation
Mahdi Triki (Philips Research)
Various speech enhancement techniques (e.g. noise suppression, dereverberation) rely on the knowledge of the statistics of the clean signal and the noise process. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. With this respect, subspace based approaches have shown to allow for reduced estimation delay and perform a good tracking vs. final misadjustment tradeoff. For an accurate noise non-stationarity tracking, these schemes have the challenge to estimate the correlation matrix of the observed signal from a limited number of samples. In this paper, we investigate the effect of the covariance estimation artifacts on the noise PSD tracking. We show that the estimation downsides could be alleviated using an appropriate selection scheme.

11:20 Maximum a posteriori estimation of noise from non-acoustic reference signals in very low signal-to-noise ratio environments
Ben Milner (University of East Anglia)
This paper examines whether non-acoustic noise reference signals can provide accurate estimates of noise at very low signal-to-noise ratios (SNRs) where conventional estimation methods are less effective. The environment chosen for the investigation is Formula 1 motor racing where SNRs are as low as -15dB and the non-acoustic reference signals are engine speed, road speed and throttle measurements. Noise is found to relate closely to these reference signals and a maximum a posteriori method (MAP) is proposed to estimate airflow and tyre noise from these parameters. Objective tests show MAP estimation to be more accurate than a range of conventional noise estimation methods. Subjective listening tests then compare speech enhancement using the proposed MAP estimation to conventional methods with the former found to give significantly higher speech quality.

11:40 Blind speech prior estimation for generalized minimum mean-square error short-time spectral amplitude estimator
Ryo Wakisaka (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)
Tomoya Takatani (Toyota Motor Corporation)
In this paper, to achieve high-quality speech enhancement, we introduce the generalized minimum mean-square error short-time spectral amplitude estimator with a new blind prior estimation of the speech probability density function (p.d.f.). To deal with various types of speech signals with different p.d.f., we propose an algorithm of speech kurtosis estimation based on moment-cumulant transformation for blind adaptation to the shape parameter of speech p.d.f. From the objective and subjective evaluation experiments, we show the improved noise reduction performance of the proposed method.

Technical Programme

Mon-Ses1-O5:Speech Enhancement analysis and Evaluation

Mon-Ses1-O5:
Speech Enhancement analysis and Evaluation