Foley sound synthesis refers to the creation of authentic, diegetic sound effects for media, such as film or radio. In this study, we construct a neural Foley synthesizer capable of generating mono-audio clips across seven predefined categories. Our approach introduces multiple enhancements to existing models in the text-to-audio domain, with the goal of enriching the diversity and acoustic characteristics of the generated foleys. Notably, we utilize a pre-trained encoder that retains acoustical and musical attributes in intermediate embeddings, implement class-conditioning to enhance differentiability among foley classes in their intermediate representations, and devise an innovative transformer-based architecture for optimizing self-attention computations on very large inputs without compromising valuable information. Subsequent to implementation, we present intermediate outcomes that surpass the baseline, discuss practical challenges encountered in achieving optimal results, and outline potential pathways for further research.
翻译:拟音合成是指为影视或广播等媒体创作真实、叙事性的音效。本研究构建了一个神经拟音合成器,能够生成涵盖七个预定义类别的单声道音频片段。我们的方法对文本到音频领域的现有模型进行了多项增强,旨在丰富生成拟音的多样性及其声学特性。值得注意的是,我们使用了一个预训练编码器,在中间嵌入中保留声学与音乐属性;引入类别条件化机制,增强不同拟音类别在中间表示中的区分度;并设计了一种创新的基于Transformer的架构,用于在优化超大输入自注意力计算的同时,避免有价值信息的损失。实现后,我们展示了超越基线的中间结果,讨论了实现最优效果时遇到的实际挑战,并概述了进一步研究的潜在方向。