Historique: Determined convolutive mixtures under dynamic conditions
Aperçu de cette version: 19
Determined convolutive mixtures under dynamic conditions
Under construction
Blind source separation (BSS) in real-world environments is a challenging task even for the simplest well determined case where the number of the sources is known in advance and is less or equal to the number of the microphones. For this reason the experimental evaluation of most of the algorithms proposed in literature is conducted in controlled scenarios: i.e. the reverberation is not very high, the length of the mixtures is given, the sources are observed for a relatively long time and do not change their locations. However, such conditions do not reflect well a real-word scenario, i.e., the reverberation can not be neglected, many sources can move in the environment or can produce sounds from random locations (e.g. as in a meeting where multiple speakers are in different static locations). Furthermore the source activity is unknown and different sources overlap in different time-instants.
The aim of this task is to evaluate BSS algorithms able to handle all these uncertainty in a blind manner. In particular we focus on the case of dynamic mixing conditions and on the unknown source activity, defining two different scenarios and their related datasets:
1) random source activity of multiple sources in multiple static locations;
2) a source continuously moving and overlapped with a source in a fixed or random location.
This task is derived from Determined convolutive mixtures under dynamic conditions in SiSEC2010. While in SiSEC2010 only two channels were considered, in these datasets four channel mixtures are provided. However, we still consider the case where two speakers are simultaneously active at most. Participants can decide whether using all the available channels or only a subset of them.
Description of the datasets
The recordings were obtained in a real room of size (6m x 5m x 4m) with an estimated reverberation time of about T60=700 ms. For both the datasets the signals were recorded by a uniform linear array of four (directional) microphones with a different spacing (of about 2cm, 8cm, and 18 cm) and sampled at Fs=16kHz.
("random source activity of multiple sources in multiple static locations")It emulates the conditions of a meeting where the activity and the locations of multiple speakers vary randomly over time. However, when a speaker is active it does not change its location. For the sake of simplicity, we simulated the case where two speakers are simultaneously active at most.
Sources are located on a semi-circle in front of the array (see figure 1) at a given distance “d” from the center of the array. We provide two different datasets for d=1.0 m or 2.0 m. Note, the latter is very difficult, due to the reduced direct-to-reverberant ratio (DRR) and high reverberant condition.
Figure1: geometrical setup
Two different geometrical setups have been simulated:
a) Setup 1:
The competing sources are always located on different angular sides with respect to the center of the array, i.e. one in (-90°;-0°) while the other in (0°;+90°) (see figure 2).b) Setup 2:
The two competing sources can be located in the whole angular space (-90;+90), but never on the same angular direction (see figure 3).The source mixtures are obtained by summing the individual source components recorded by each microphone. The components are generated by convolving random utterances with measured impulse responses and contaminated with an additive white Gaussian noise (AWGN) according to an SNR of 50dB.
Figure2: graphical explanation of dataset1, setup1 (a different color means a different source)
Figure3: graphical explanation of dataset1, setup2 (a different color means a different source)
("a continuously moving active source overlapped with a source in a fixed or random location")This dataset serves as an evaluation of source separation algorithms that are able to handle the case of continuously varying mixing conditions. Differently from dataset 1), the mixing condition of one of the sources changes continuously during the activity of. Therefore, BSS requires a continuous update of the parameters of the demixing system. Note, we restrict to the case where only one source is continuously moving since the task is already harden by the reverberation condition. Two different scenarios are simulated.
a) 1 moving source 1 fixed source:
one source is continuously moving while the other is at fixed location (see figure 4).b) 1 moving source 1 at random location:
one source is continuously moving while the other is in a static random location (which can vary over time) (see figure 5).For both the cases a) and b) the static source is located on a semi-circle in the angular space within (-90°;0°) and at a distance d=1.0 m from the center of the array. The moving sources is located in the angular space within (0°;90°) and at variable distance from the array (between about 0.5m and 1.2 m).
The mixtures were obtained by summing the spatial image (responses) of the individual moving source with those of the static source (the latter is simulated as for dataset1). Note that such a mixture is not realistic in full, because all moving objects affect all source-microphone impulse responses. However, individual spatial images are required for a more accurate performance evaluation.
Figure4: graphical explanation of dataset2, scenario a) (a different color means a different source)
Figure5: graphical explanation of dataset2, scenario b) (a different color means a different source)
Development datasets
The development datasets are in the archive:
It includes the directories/sub-directories:
dataset 1
--1m_setup1 (array source-distance d=1.0m, setup 1)
--2m_setup2 (array source-distance d=2.0m, setup 2)
--1m_setup1 (array source-distance d=1.0m, setup 1)
--2m_setup2 (array source-distance d=2.0m, setup 2)
dataset 2
--1moving_1fixed (1 source is moving while the other is at a fixed location)
--1moving_1random (1 source is moving while the other is at a random location)
Each subdirectory includes 4ch .wav files having the following sintax:
stereo mixtures: x_<array spacing>.wav
separated source 1: source_image1_<array spacing>.wav
separated source 2: source_image2_<array spacing>.wav
segmentation file: segments_<array spacing>.mat
where <array spacing> can be either “2cm”, “8cm” or “18cm”. The segmentation file contains a struct array with fields: “start” and “end”. Each element of the array indicates the time bounds of a segment where two sources are active at same time. This is useful for a correct performance evaluation (e.g. through bss_eval), which should consider only segments where two sources are active at the same time.
Test datasets
The test datasets are in the archive:
Files are organized as in the dev archive. The segmentation file and the individual source image files are not included. Note, the data in the test has been simulated with different instances of impulse responses (i.e. at different angular directions).
We propose the following two tasks:
- mono source signal estimation: estimation of the source signals
- stereo source signal estimation: estimation of the stereo microphone images (responses) of the separated sources
Each participant is asked to submit the estimation results of his/her algorithm for tasks 1 and/or 2 over all or part of mixtures in the test datasets.
Files have to be organized in directories as for "dev" and "test" datasets, including in the folders the output files with the following syntax:
source 1: y_<array spacing>_src_1.wav (single channel .wav file)
source 2: y_<array spacing>_src_2.wav (single channel .wav file)
spatial image of source 1: y_<array spacing>_img_1.wav (4ch .wav file)
spatial image of source 2: y_<array spacing>_img_2.wav (4ch .wav file)
Evaluation criteria
Based on the evaluation method for source signal estimation in SiSEC2008, we propose to evaluate the estimated source signal(s) (and/or source images) via the criteria defined in the BSS_EVAL toolbox. These criteria allow an arbitrary filtering between the estimated source and the true source and measure interference, distoriton and artifacts separately. All source orderings are tested and the ordering leading to the best SIR is selected, which treats the permutation ambiguity. Several tools for evaluation can be found at previous SiSEC2008 page.
Additional evaluation will be provided through the perceptual evaluation toolkit PEASS.
Potential Participants
J. Malek
Z. Koldovsky
B. Loesch
F. Nesta
S. Araki
W. Kellerman
Task proposed by the Audio Committee