Historique: Two-channel mixtures of speech and real-world background noise
Aperçu de cette version: 79
Two-channel mixtures of speech and real-world background noise
This task aims at evaluating source separation and denoising techniques in the context of speech enhancement by merging two datasets: the SiSEC 2010 noisy speech dataset



Description of the dataset
We consider two-channel mixtures of one speech source and real-world background noise sampled at 16 kHz. The dataset consists of two parts:
Public environments
These data are part of the SiSEC 2010 noisy speech dataset
-
Su1
: subway car moving -
Su2
: subway car standing at station -
Ca1
: cafeteria 1 -
Ca2
: cafeteria 2 -
Sq1
: square 1 -
Sq2
: square 2
-
Ce
: center (except inSu
) -
Co
: corner
Su
and Ca
, the speech signals were recorded in an office room using the same microphone pair. For the outdoor environment Sq
, the speech signals were mixed anechoically. The distance between the sound source and the array centroid was 1.0 m for female speech and 0.8 m for male speech. The direction of arrival (DOA) of the speech source was different in each mixture and the signal-to-noise ratio (SNR) was drawn randomly between -17 and +12 dB.Domestic environment
These data are part of the CHiME corpus
-
Li
: living room
- -6 dB
- -3 dB
- 0 dB
- 3 dB
- 6 dB
- 9 dB

Test data
Download the test set

The data consist of 44 stereo WAV audio files that can be imported in Matlab using the wavread command. These files are named
test_<env>_<cond>_<take>_mix.wav
, where <env>
is the noise environment, <cond>
the recording condition (center/corner or SNR) and <take>
a letter from A to D.Entrants wishing to exploit the context of each sentence in the domestic environment database can also download the corresponding 5 min recordings

The data consist of 6 stereo WAV audio files that can be imported in Matlab using the wavread command. The text file
embedding.txt
describes the position of the selected sentences within these recordings. Each line provides the following five pieces of information:- sentence filename (SiSEC)
- 5 min recording filename
- start position in samples within the 5 min recording
- duration in samples
- sentence filename (ChiME)
Development data
Download the development set

The data consists of 136 WAV audio files that can be imported in Matlab using the wavread command and 10 text files. These files are named as follows:
-
dev_<env>_<cond>_<take>_src.wav
: single-channel speech signal -
dev_<env>_<cond>_<take>_sim.wav
: two-channel spatial image of the speech source -
dev_<env>_<cond>_<take>_noi.wav
: two-channel spatial image of the background noise -
dev_<env>_<cond>_<take>_mix.wav
: two-channel mixture signal -
dev_<env>_<cond>_<take>_DOA.txt
: DOA of the speech source (see the SiSEC 2010 wikifor the convention adopted to measure DOA)
Su
and Ca
environments, they might contain a measurement error up to a few degrees; on the contrary, there is no such error in the Sq
environment.Entrants wishing to exploit the context of each sentence in the domestic environment database can also download the corresponding 5 min recordings

Training data
The CHiME database also includes training data

Tasks and reference software
We propose the following 3 tasks:
- speaker DOA estimation: estimate the DOA of the speech source (public environments only)
- speech signal estimation: estimate the single-channel dereverberated speech signal
- speech and noise spatial image estimation: decompose the mixture signal into two two-channel signals corresponding to the speech source and the background noise
Reference software will eventually be provided for each of these tasks.
Due to the specific construction of the dataset, at least four strategies may be employed to process the domestic environment mixtures:
- process each mixture (= 1 isolated sentence) alone
- process all mixtures with the same SNR (= 4 successive sentences without silence) together
- process the whole 5 min recording without knowledge of the sentence positions
- process the whole 5 min recording using knowledge of the sentence positions
Submission
Each participant is asked to submit the results of his/her algorithm for task 2 and/or 3 over all or part of the mixtures in the development dataset and the test dataset. The results for task 1 may also be submitted if possible.
In addition, each participant is asked to provide basic information about his/her algorithm (bibliographical reference, employed processing strategy, etc) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.
Evaluation criteria
We propose to use the same evaluation criteria as in SiSEC 2010, except that the order of the estimated sources must be recovered.
The estimated speaker DOAs in task 1 will be evaluated in terms of absolute difference with the true DOAs.
The estimated speech signals in task 2 will be evaluated via the energy ratio criteria defined in the BSS_EVAL


The estimated speech and noise spatial image signals in task 3 will be evaluated via the energy ratio criteria introduced for the Stereo Audio Source Separation Evaluation Campaign


Performance will be compared to that of ideal binary masking as a benchmark (i.e. binary masks providing maximum SDR), computed over a STFT or a cochleagram.
The above performance criteria and benchmarks are respectively implemented in
Licensing issues
All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0

Public environment data were authored by Ngoc Q. K. Duong and Nobutaka Ito.
Domestic environment data were authored by H. Christensen, Jon Barker, Ning Ma and Phil Green in the framework of the CHiME Project

- H. Christensen, J. Barker, N. Ma and P. Green. The CHiME corpus: a resource and a challenge for Computational Hearing in Multisource Environments. In Proc. Interspeech'10, 2010. (pdf
).
Potential Participants
- Shoko Araki (araki.shoko (a) lab_ntt_co_jp)
- Dorothea Kolossa (dorothea_kolossa (a) ruhr-uni-bochum_de)
- Alexey Ozerov (alexey_ozerov (a) inria_fr)
- Francesco Nesta (nesta (a) fbk_eu)
- Armin Sehr (sehr (a) nt_e-technik_uni-erlangen_de)
- Ngoc Duong
- Jani Even
- Hiroshi Saruwatari
- Dang Hai Tran Vu
- Hiroshi Sawada
Task proposed by the Audio Committee