Two-channel mixtures of speech and real-world background noise

This task aims at evaluating source separation and denoising techniques in the context of speech enhancement by merging two datasets: the SiSEC 2010 noisy speech dataset (external link)

and the CHiME corpus (external link)

. Both datasets consist of two-channel mixtures of one speech source and real-world background noise, so that algorithms applicable to one dataset are applicable to the other without additional effort. The source separation results obtained over the latter dataset will be analyzed in line of the speech recognition results obtained over that dataset as part of the CHiME Challenge (external link)

Results

See the results over test and development data

Description of the dataset

We consider two-channel mixtures of one speech source and real-world background noise sampled at 16 kHz. The dataset consists of two parts:

Public environments

These data are part of the SiSEC 2010 noisy speech dataset (external link)

. Background noise signals were recorded via a pair of omnidirectional microphones spaced by 8.6 cm in six different public environments:

Su1: subway car moving
Su2: subway car standing at station
Ca1: cafeteria 1
Ca2: cafeteria 2
Sq1: square 1
Sq2: square 2

and in two different positions within each environment:

Ce: center (except in Su)
Co: corner

Several recordings identified by letters (A or B) were made in each case. Mixtures were then generated by adding a speech signal to the background noise signal. For the reverberant environments Su and Ca, the speech signals were recorded in an office room using the same microphone pair. For the outdoor environment Sq, the speech signals were mixed anechoically. The distance between the sound source and the array centroid was 1.0 m for female speech and 0.8 m for male speech. The direction of arrival (DOA) of the speech source was different in each mixture and the signal-to-noise ratio (SNR) was drawn randomly between -17 and +12 dB.

Domestic environment

These data are part of the CHiME corpus (external link)

. Background noise signals were recorded via a pair of in-ear microphones placed on a mannikin in a single domestic environment:

Li: living room

at six different SNRs:

-6 dB
-3 dB
0 dB
3 dB
6 dB
9 dB

Several recordings were made for a duration of 5 min each. Mixtures were then generated by adding a number of speech sentences to the background noise signal. The speech signals were generated by convolving clean speech sentences with binaural room impulse responses (BRIR) measured at a position 2 m directly in front of the mannikin. Hence the DOA of the speech source is known and fixed. Among this wealth of data, we selected four successive sentences identified by letters (A to D) for each SNR. The silence between successive sentences was removed. Entrants can either process each sentence separately, exploit the neighboring sentences or the whole 5 min context. For more details about this dataset, see the following paper (external link)

Test data

Download the test set (13 MB)

The data consist of 44 stereo WAV audio files that can be imported in Matlab using the wavread command. These files are named test_<env>_<cond>_<take>_mix.wav, where <env> is the noise environment, <cond> the recording condition (center/corner or SNR) and <take> a letter from A to D.

Entrants wishing to exploit the context of each sentence in the domestic environment database can also download the corresponding 5 min recordings (87 MB)

The data consist of 6 stereo WAV audio files that can be imported in Matlab using the wavread command. The text file embedding.txt describes the position of the selected sentences within these recordings. Each line provides the following five pieces of information:

sentence filename (SiSEC)
5 min recording filename
start position in samples within the 5 min recording
duration in samples
sentence filename (ChiME)

Development data

Download the development set (24 MB)

The data consists of 136 WAV audio files that can be imported in Matlab using the wavread command and 10 text files. These files are named as follows:

dev_<env>_<cond>_<take>_src.wav: single-channel speech signal
dev_<env>_<cond>_<take>_sim.wav: two-channel spatial image of the speech source
dev_<env>_<cond>_<take>_noi.wav: two-channel spatial image of the background noise
dev_<env>_<cond>_<take>_mix.wav: two-channel mixture signal
dev_<env>_<cond>_<take>_DOA.txt: DOA of the speech source (see the SiSEC 2010 wiki for the convention adopted to measure DOA)

Since the source DOAs were measured geometrically in the Su and Ca environments, they might contain a measurement error up to a few degrees; on the contrary, there is no such error in the Sq environment.

The mixtures dev_Ca1_Co_A_mix.wav and dev_Ca1_Co_B_mix.wav are identical (this is a mistake that will be corrected in future evaluations).

Entrants wishing to exploit the context of each sentence in the domestic environment database can also download the corresponding 5 min recordings (86 MB) (same nomenclature as above).

Training data

The CHiME database also includes training data (external link)

for each speaker and for the background noise.

Tasks and reference software

We propose the following 3 tasks:

speaker DOA estimation: estimate the DOA of the speech source (public environments only)
speech signal estimation: estimate the single-channel dereverberated speech signal
speech and noise spatial image estimation: decompose the mixture signal into two two-channel signals corresponding to the speech source and the background noise

Participants are welcome to use some of the Matlab reference software below to build their own algorithms:

stft_multi.m: multichannel STFT
istft_multi.m: multichannel inverse STFT
example_denoising.m: TDOA estimation by GCC-PHATmax, ML target and noise variance estimation under a diffuse noise model, and multichannel Wiener filtering

Due to the specific construction of the dataset, at least four strategies may be employed to process the domestic environment mixtures:

process each mixture (= 1 isolated sentence) alone
process all mixtures with the same SNR (= 4 successive sentences without silence) together
process the whole 5 min recording without knowledge of the sentence positions
process the whole 5 min recording using knowledge of the sentence positions

In any case, it is expected that the submitted signals correspond to the test mixtures (= isolated sentences).

Submission

Each participant is asked to submit the results of his/her algorithm for task 2 and/or 3 over all or part of the mixtures in the development dataset and the test dataset. The results for task 1 may also be submitted if possible.

Each participant should make his results available online in the form of a tarball with the following file naming convention:

test_<env>_<cond>_<take>_src.wav: single-channel speech signal
test_<env>_<cond>_<take>_sim.wav: two-channel spatial image of the speech source
test_<env>_<cond>_<take>_noi.wav: two-channel spatial image of the background noise
test_<env>_<cond>_<take>_DOA.txt: DOA of the speech source

For the domestic environment dataset, the CHiME file naming convention is also acceptable.

Each participant should then send an email to "araki.shoko (at) lab.ntt.co.jp", "nesta (a) fbk.eu" and "emmanuel.vincent (at) inria.fr" providing:

contact information (name, affiliation)
basic information about his/her algorithm, including the employed processing strategy among the four strategies outlined above, its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
the URL of the tarball

The submitted audio files will be made available on a website under the terms of the Licensing section below.

Evaluation criteria

We propose to use the same evaluation criteria as in SiSEC 2010, except that the order of the estimated sources must be recovered.

The estimated speaker DOAs in task 1 will be evaluated in terms of absolute difference with the true DOAs.

The estimated speech signals in task 2 will be evaluated via the energy ratio criteria defined in the BSS_EVAL (external link)

toolbox allowing arbitrary filtering between the estimated source and the true source.

The estimated speech and noise spatial image signals in task 3 will be evaluated via the energy ratio criteria introduced for the Stereo Audio Source Separation Evaluation Campaign (external link)

and via the perceptually-motivated criteria in the PEASS (external link)

toolkit.

Performance will be compared to that of ideal binary masking as a benchmark (i.e. binary masks providing maximum SDR), computed over a STFT or a cochleagram.

The above performance criteria and benchmarks are respectively implemented in

An example use is given in example_denoising.m.

Licensing issues

All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 (external link)

license. The files to be submitted by participants will be made available on a website under the terms of the same license.

Public environment data were authored by Ngoc Q. K. Duong and Nobutaka Ito.

Domestic environment data were authored by H. Christensen, Jon Barker, Ning Ma and Phil Green in the framework of the CHiME Project (external link)

funded by the EPSRC. If you use these data in any published research please cite:

: H. Christensen, J. Barker, N. Ma and P. Green. The CHiME corpus: a resource and a challenge for Computational Hearing in Multisource Environments. In Proc. Interspeech'10, 2010. (pdf).

Potential Participants

Shoko Araki (araki.shoko (a) lab_ntt_co_jp)
Dorothea Kolossa (dorothea_kolossa (a) ruhr-uni-bochum_de)
Alexey Ozerov (alexey_ozerov (a) inria_fr)
Francesco Nesta (nesta (a) fbk_eu)
Armin Sehr (sehr (a) nt_e-technik_uni-erlangen_de)
Ngoc Duong
Jani Even
Hiroshi Saruwatari
Dang Hai Tran Vu
Hiroshi Sawada

Task proposed by the Audio Committee

Similaire