This paper introduces FALL-E, a foley synthesis system and its training/inference strategies. The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and utilized a pre-trained language model. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input. Moreover, we leveraged external language models to improve text descriptions of our datasets and performed prompt engineering for quality, coherence, and diversity. FALL-E was evaluated by an objective measure as well as listening tests in the DCASE 2023 challenge Task 7. The submission achieved the second place on average, while achieving the best score for diversity, second place for audio quality, and third place for class fitness.
翻译:本文介绍FALL-E,一种拟音合成系统及其训练/推理策略。FALL-E模型采用级联方法,包含低分辨率频谱图生成、频谱图超分辨率以及声码器。我们利用大规模数据集从头训练所有与声音相关的模型,并采用预训练语言模型。我们通过数据集特定文本对模型进行条件约束,使其能够基于文本输入学习音质和录制环境。此外,我们借助外部语言模型改进数据集文本描述,并针对质量、连贯性和多样性进行提示工程。FALL-E通过客观指标评估以及DCASE 2023挑战赛第七任务的听测实验进行评价。该方案在平均分上获得第二名,同时在多样性指标上取得最佳成绩,音频质量位列第二,类别契合度排名第三。