The goal of DCASE 2023 Challenge Task 7 is to generate various sound clips for Foley sound synthesis (FSS) by "category-to-sound" approach. "Category" is expressed by a single index while corresponding "sound" covers diverse and different sound examples. To generate diverse sounds for a given category, we adopt VITS, a text-to-speech (TTS) model with variational inference. In addition, we apply various techniques from speech synthesis including PhaseAug and Avocodo. Different from TTS models which generate short pronunciation from phonemes and speaker identity, the category-to-sound problem requires generating diverse sounds just from a category index. To compensate for the difference while maintaining consistency within each audio clip, we heavily modified the prior encoder to enhance consistency with posterior latent variables. This introduced additional Gaussian on the prior encoder which promotes variance within the category. With these modifications, we propose VIFS, variational inference for end-to-end Foley sound synthesis, which generates diverse high-quality sounds.
翻译:DCASE 2023挑战赛任务7的目标是通过“类别到声音”方法生成多样化声音片段用于拟音声音合成(FSS)。“类别”由单一索引表示,而对应的“声音”涵盖了多样且不同的声音示例。为在给定类别下生成多样化声音,我们采用VITS——一种基于变分推断的文本转语音(TTS)模型。此外,我们应用了包括PhaseAug和Avocodo在内的多种语音合成技术。与基于音素和说话人身份生成短发音的TTS模型不同,“类别到声音”问题仅需从类别索引生成多样化声音。为在保持每个音频片段一致性的同时弥补这一差异,我们对先验编码器进行了重大修改,以增强与后验潜变量的一致性。这为先验编码器引入了额外的高斯分布,从而促进类别内的方差。基于这些改进,我们提出VIFS(面向端到端拟音声音合成的变分推断),用于生成多样化且高质量的声音。