Historique: Underdetermined speech and music mixtures

Comparaison de la version 10 à la version 20

Underdetermined-speech and music mixtures


We propose to repeat the underdetermined-speech and music mixtures (external link) task in SiSEC2010 with fresh test data.


Test data


We have three datasets:
Download test.zip (external link) (22 MB) (former test data of SiSEC2008 (external link).)
Download test2.zip (external link) (16 MB) (former test data of SiSEC2010 (external link).)
Download test3.zip (external link) (8.6MB)(fresh data for SiSEC2011. This is the 3-ch mixtures of 4 speech sources.)

test.zip

test.zip contains three types of stereo mixtures and we use the following two:
  • instantaneous mixtures (static sources scaled by positive gains)
  • live recordings (static sources played through loudspeakers in a meeting room, recorded one at a time by a pair of omnidirectional microphones and subsequently added together)
    • CAUTION: For SiSEC2011, we will NOT evaluate "synthetic convolutive mixtures" (static sources filtered by synthetic room impulse responses simulating a pair of omnidirectional microphones via the Roomsim toolbox).

The room dimensions are the same for synthetic convolutive mixtures and live recordings (4.45 x 3.55 x 2.5 m). The reverberation time is set to either 130 ms or 250 ms and the distance between the two microphones to either 5 cm or 1 m, resulting in 9 mixing conditions overall.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:

  • 4 male speech sources
  • 4 female speech sources
  • 3 male speech sources
  • 3 female speech sources
  • 3 non-percussive music sources
  • 3 music sources including drums

The source directions of arrival vary between -60 degrees and +60 degrees with a minimal spacing of 15 degrees and the distances between the sources and the center of the microphone pair vary between 80 cm and 1.20 m.

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Glen Phillips, Mark Engelberg, Psycho Voyager, Nine Inch Nails and Ali Farka Touré for music source signals and Shoko Araki and Emmanuel Vincent for mixture signals.

test2.zip

test2.zip contains two types of stereo mixtures:
  • instantaneous mixtures (static sources scaled by positive and negative gains)
  • simulated recordings (static sources filtered by impulse responses recorded in a real room situation with loudspeakers and omnidirectional microphones)

The room dimension for simulated recordings was 4.45 x 3.55 x 2.5 m, and the distances between the sources and the center of the microphone pair was 1.20 m. The reverberation time for simulated recordings was set to either 130 ms or 380 ms and the distance between the two microphones to either 4 cm or 20 cm. Therefore, 5 mixing conditions are considered, together with instantaneous mixtures.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:
  • 4 male speech sources
  • 4 female speech sources
  • 3 male speech sources
  • 3 female speech sources
  • 3 non-percussive music sources
  • 3 music sources including drums

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test2_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing Issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Shannon Hurley, Nine Inch Nails, AlexQ (Alexander Lozupone), Mokamed, Carl Leth and Jim's Big Ego for music source signals and Hiroshi Sawada for mixture signals.


test3.zip

test3.zip contains two types of 3-ch mixtures:
  • instantaneous mixtures (static sources scaled by positive and negative gains)
  • simulated recordings (static sources filtered by impulse responses recorded in a real room situation with loudspeakers and omnidirectional microphones)

The room dimension for simulated recordings was 4.45 x 3.55 x 2.5 m, and the distances between the sources and the center of the microphone pair was 1.0 m. The reverberation time for simulated recordings was set to either 130 ms or 380 ms and the distance between the two microphones to either 5 cm or 50 cm. Therefore, 5 mixing conditions are considered, together with instantaneous mixtures.

For each mixing condition, 2 mixture signals have been generated from different sets of source signals placed at different spatial positions:
  • 4 male speech sources
  • 4 female speech sources

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test3_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing Issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The author is Shoko Araki for mixture signals.


Development data


Download dev1.zip (external link) (91 MB)
Download dev2.zip (external link) (47 MB)
(Both are the former development data of SiSEC2008 (external link) and SiSEC2010 (external link))
Download https://www.irisa.fr/metiss/members/dev3/download (external link) (47 MB) (Fresh development data for 3-ch mixtures.)

The data consist of Matlab MAT-files and WAV audio files, that can be imported in Matlab using the commands load and wavread respectively. These files are named as follows:

  • dev1_<srcset>_<mixtype>_<reverb>_src_<j>.wav: mono source signal
  • dev1_<srcset>_inst_matrix.mat: mixing matrix for instantaneous mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_setup.txt: positions of the sources for convolutive mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_filt.mat: mixing filter system for convolutive mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_sim_<j>.wav: stereo contribution of a source signal to the two mixture channels
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav: stereo mixture signal

where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time, <spacing> the microphone spacing and <j> the source index.


All mixture signals and source image signals have 10s duration. Music source signals have 11s duration to avoid border effects within convolutive mixtures. The last 10s are then selected once the mixing system has been applied.

Note about dev3
The development data dev3 consists only of WAV audio files,
  • dev3_<srcset>_<mixtype>_<reverb>_src_<j>.wav: mono source signal
  • dev3_<srcset>_<mixtype>_<reverb>_<spacing>_sim_<j>.wav: stereo contribution of a source signal to the two mixture channels
  • dev3_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav: stereo mixture signal


Licensing issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Another Dreamer and Alex Q for music source signals and Hiroshi Sawada, Shoko Araki and Emmanuel Vincent for mixture signals.


Tasks

The source separation problem has been split into four tasks:
    1. source counting (estimate the number of sources)
    2. source signal estimation (estimate the mono source signals)
    3. source spatial image estimation (estimate the stereo contribution of each source to the two mixture channels)


Submissions


Each participant is asked to submit the results of his/her algorithm for tasks 2 and/or 3

The results for task 1 may also be submitted.

In addition, each participant is asked to provide basic information about his/her algorithm (e.g. a bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

"How to submit" will be announced by July, 2011.

Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license.


Reference software


Please refer the previous SiSEC2008 page (external link).

Evaluation criteria


We propose to use the same evaluation criteria as in SiSEC 2010, except that the order of the estimated sources must be recovered.

The estimated speech signals in tasks 3 and 4 will be evaluated via the energy ratio criteria defined in the BSS_EVAL (external link) toolbox allowing arbitrary filtering between the estimated source and the true source and via the perceptually-motivated criteria in the PEASS (external link) toolkit.

Performance will be compared to that of ideal binary masking as a benchmark (i.e. binary masks providing maximum SDR), computed over a STFT or a cochleagram.

The above performance criteria and benchmarks are respectively implemented in

Potential Participants

  • Dan Barry
  • Pau Bofill
  • Andreas Ehmann
  • Vikrham Gowreesunker
  • Matt Kleffner
  • Nikolaos Mitianoudis
  • Hiroshi Sawada
  • Emmanuel Vincent
  • Ming Xiao
  • Ron Weiss
  • Michael Mandel
  • Shoko Araki
  • Yosuke Izumi
  • Taesu Kim
  • Maximo Cobos
  • John Woodruff
  • Antonio Rebordao
  • Alexey Ozerov
  • Andrew Nesbit
  • Matthieu Puigt
  • Simon Arberet
  • Zaher Elchami
  • Ngoc Q. K. Duong
  • Zafar Rafii
  • Francesco Nesta

Task proposed by Audio Committee

Back to Audio source separation top

Underdetermined-speech and music mixtures


We propose to repeat the underdetermined-speech and music mixtures (external link) task in SiSEC2010 with fresh test data.

Results


Test data


We have three datasets:
Download test.zip (external link) (22 MB) (former test data of SiSEC2008 (external link).)
Download test2.zip (external link) (16 MB) (former test data of SiSEC2010 (external link).)
Download test3.zip (external link) (8.6MB)(fresh data for SiSEC2011. This is the 3-ch mixtures of 4 speech sources.)

test.zip

test.zip contains three types of stereo mixtures and we use the following two:
  • instantaneous mixtures (static sources scaled by positive gains)
  • live recordings (static sources played through loudspeakers in a meeting room, recorded one at a time by a pair of omnidirectional microphones and subsequently added together)
    • CAUTION: For SiSEC2011, we will NOT evaluate "synthetic convolutive mixtures" (static sources filtered by synthetic room impulse responses simulating a pair of omnidirectional microphones via the Roomsim toolbox).

The room dimensions are the same for synthetic convolutive mixtures and live recordings (4.45 x 3.55 x 2.5 m). The reverberation time is set to either 130 ms or 250 ms and the distance between the two microphones to either 5 cm or 1 m, resulting in 9 mixing conditions overall.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:

  • 4 male speech sources
  • 4 female speech sources
  • 3 male speech sources
  • 3 female speech sources
  • 3 non-percussive music sources
  • 3 music sources including drums

The source directions of arrival vary between -60 degrees and +60 degrees with a minimal spacing of 15 degrees and the distances between the sources and the center of the microphone pair vary between 80 cm and 1.20 m.

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Glen Phillips, Mark Engelberg, Psycho Voyager, Nine Inch Nails and Ali Farka Touré for music source signals and Shoko Araki and Emmanuel Vincent for mixture signals.

test2.zip

test2.zip contains two types of stereo mixtures:
  • instantaneous mixtures (static sources scaled by positive and negative gains)
  • simulated recordings (static sources filtered by impulse responses recorded in a real room situation with loudspeakers and omnidirectional microphones)

The room dimension for simulated recordings was 4.45 x 3.55 x 2.5 m, and the distances between the sources and the center of the microphone pair was 1.20 m. The reverberation time for simulated recordings was set to either 130 ms or 380 ms and the distance between the two microphones to either 4 cm or 20 cm. Therefore, 5 mixing conditions are considered, together with instantaneous mixtures.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:
  • 4 male speech sources
  • 4 female speech sources
  • 3 male speech sources
  • 3 female speech sources
  • 3 non-percussive music sources
  • 3 music sources including drums

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test2_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing Issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Shannon Hurley, Nine Inch Nails, AlexQ (Alexander Lozupone), Mokamed, Carl Leth and Jim's Big Ego for music source signals and Hiroshi Sawada for mixture signals.


test3.zip

test3.zip contains two types of 3-ch mixtures:
  • instantaneous mixtures (static sources scaled by positive and negative gains)
  • simulated recordings (static sources filtered by impulse responses recorded in a real room situation with loudspeakers and omnidirectional microphones)

The room dimension for simulated recordings was 4.45 x 3.55 x 2.5 m, and the distances between the sources and the center of the microphone array (linear array) was 1.0 m. The reverberation time for simulated recordings was set to either 130 ms or 380 ms and the distance between the two microphones to either 5 cm or 50 cm. Therefore, 5 mixing conditions are considered, together with instantaneous mixtures.

For each mixing condition, 2 mixture signals have been generated from different sets of source signals placed at different spatial positions:
  • 4 male speech sources
  • 4 female speech sources

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test3_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing Issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The author is Shoko Araki for mixture signals.


Development data


Download dev1.zip (external link) (91 MB)
Download dev2.zip (external link) (47 MB)
(Both are the former development data of SiSEC2008 (external link) and SiSEC2010 (external link))
Download dev3.zip (external link) (47 MB) (Fresh development data for 3-ch mixtures.)

The data consist of Matlab MAT-files and WAV audio files, that can be imported in Matlab using the commands load and wavread respectively. These files are named as follows:

  • dev1_<srcset>_<mixtype>_<reverb>_src_<j>.wav: mono source signal
  • dev1_<srcset>_inst_matrix.mat: mixing matrix for instantaneous mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_setup.txt: positions of the sources for convolutive mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_filt.mat: mixing filter system for convolutive mixtures
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_sim_<j>.wav: stereo contribution of a source signal to the two mixture channels
  • dev1_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav: stereo mixture signal

where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time, <spacing> the microphone spacing and <j> the source index.


All mixture signals and source image signals have 10s duration. Music source signals have 11s duration to avoid border effects within convolutive mixtures. The last 10s are then selected once the mixing system has been applied.

Note about dev1 and dev2
The development data dev1 and dev2 have the same setup as that of test1.
The development set corresponding to test2 is not provided.

Note about dev3
The development data dev3 consists only of WAV audio files,
  • dev3_<srcset>_<mixtype>_<reverb>_src_<j>.wav: mono source signal
  • dev3_<srcset>_<mixtype>_<reverb>_<spacing>_sim_<j>.wav: stereo contribution of a source signal to the two mixture channels
  • dev3_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav: stereo mixture signal


Licensing issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license. The authors are Another Dreamer and Alex Q for music source signals and Hiroshi Sawada, Shoko Araki and Emmanuel Vincent for mixture signals.


Tasks

The source separation problem has been split into four tasks:
    1. source counting (estimate the number of sources)
    2. source signal estimation (estimate the mono source signals)
    3. source spatial image estimation (estimate the stereo contribution of each source to the two mixture channels)


Submissions


Each participant is asked to submit the results of his/her algorithm for tasks 2 and/or 3

The results for task 1 may also be submitted.

In addition, each participant is asked to provide basic information about his/her algorithm (e.g. a bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

How to submit

Each participant should make his results available online in the form of a tarball called <YourName>_<dataset>.zip.

The included files must be named as follows:
  • <dataset>__<srcset>_<mixtype>_<reverb>_src_<j>.wav: Estimated source <j> for task 2. Mono WAV file sampled at 16 kHz.
  • <dataset>__<srcset>_<mixtype>_<reverb>_sim_<j>.wav: Estimated spatial image of source <j> for task 3. Stereo (3ch for test3/dev3) WAV file sampled at 16 kHz.
  • task1.txt: Estimated source numbers for task 1. The file's 1st column is the mixture label (dev1_<srcset>_<mixtype>_<reverb>_<spacing>_mix) and 2nd column is the estimated number of sources.

where <dataset> is one of the test/test2/test3/dev2/dev3, <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Each participant should then send an email to "araki.shoko (at) lab.ntt.co.jp" providing:
  • contact information (name, affiliation)
  • basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
  • the URL of the tarball(s)

Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license.


Reference software


Please refer the previous SiSEC2008 page (external link).

Evaluation criteria


We propose to use the same evaluation criteria as in SiSEC 2010, except that the order of the estimated sources must be recovered.

The estimated speech signals in tasks 3 and 4 will be evaluated via the energy ratio criteria defined in the BSS_EVAL (external link) toolbox allowing arbitrary filtering between the estimated source and the true source and via the perceptually-motivated criteria in the PEASS (external link) toolkit.

Performance will be compared to that of ideal binary masking as a benchmark (i.e. binary masks providing maximum SDR), computed over a STFT or a cochleagram.

The above performance criteria and benchmarks are respectively implemented in

Potential Participants

  • Dan Barry
  • Pau Bofill
  • Andreas Ehmann
  • Vikrham Gowreesunker
  • Matt Kleffner
  • Nikolaos Mitianoudis
  • Hiroshi Sawada
  • Emmanuel Vincent
  • Ming Xiao
  • Ron Weiss
  • Michael Mandel
  • Shoko Araki
  • Yosuke Izumi
  • Taesu Kim
  • Maximo Cobos
  • John Woodruff
  • Antonio Rebordao
  • Alexey Ozerov
  • Andrew Nesbit
  • Matthieu Puigt
  • Simon Arberet
  • Zaher Elchami
  • Ngoc Q. K. Duong
  • Zafar Rafii
  • Francesco Nesta

Task proposed by Audio Committee

Back to Audio source separation top

Historique

Légende : v=afficher, c=comparer, d=différences
Date UtilisateurNote à propos de cette modification Version Action
lun. 12 de déc., 2011 06:11 CET admin   20
En cours
 v
lun. 12 de déc., 2011 06:11 CET admin   19  v  c  d  
jeu. 20 de oct., 2011 02:54 CEST admin   18  v  c  d  
ven. 09 de sept., 2011 07:48 CEST admin   17  v  c  d  
ven. 01 de juill., 2011 03:51 CEST admin   16  v  c  d  
jeu. 30 de juin, 2011 07:15 CEST admin   15  v  c  d  
jeu. 30 de juin, 2011 07:14 CEST admin   14  v  c  d  
jeu. 30 de juin, 2011 07:13 CEST admin   13  v  c  d  
jeu. 30 de juin, 2011 04:46 CEST admin   12  v  c  d  
jeu. 30 de juin, 2011 04:46 CEST admin   11  v  c  d  
jeu. 30 de juin, 2011 04:44 CEST admin   10  v  c  d  
jeu. 30 de juin, 2011 04:34 CEST admin   9  v  c  d  

Menu

Rechercher avec Google

 
sisec2011.wiki.irisa.fr
WWW