In conventional studies on environmental sound separation and synthesis using captions, datasets consisting of multiple-source sounds with their captions were used for model training. However, when we collect the captions for multiple-source sound, it is not easy to collect detailed captions for each sound source, such as the number of sound occurrences and timbre. Therefore, it is difficult to extract only the single-source target sound by the model-training method using a conventional captioned sound dataset. In this work, we constructed a dataset with captions for a single-source sound named CAPTDURE, which can be used in various tasks such as environmental sound separation and synthesis. Our dataset consists of 1,044 sounds and 4,902 captions. We evaluated the performance of environmental sound extraction using our dataset. The experimental results show that the captions for single-source sounds are effective in extracting only the single-source target sound from the mixture sound.
翻译:在利用字幕进行环境声音分离与合成的传统研究中,通常使用包含多声源声音及其字幕的数据集进行模型训练。然而,在为多声源声音收集字幕时,为每个声源收集诸如声音发生次数、音色等详细字幕并不容易。因此,使用传统字幕声音数据集通过模型训练方法难以仅提取单一声源的目标声音。本研究构建了名为CAPTDURE的单一声源字幕声音数据集,可应用于环境声音分离与合成等多种任务。该数据集包含1,044个声音片段和4,902条字幕。我们使用该数据集评估了环境声音提取的性能。实验结果表明,单一声源的字幕能有效从混合声音中仅提取出单一声源的目标声音。