The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative applications in sound design and content creation. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio semantics and its acoustic features. Our approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our objective and subjective evaluation results demonstrate the effectiveness of our approach in producing high-quality, customizable audio outputs that align closely with user specifications.
翻译:文本到音频生成领域已取得显著进展,然而对生成音频声学特性进行精细控制的能力仍未被充分探索。本文提出了一种新颖而简单的方法来生成音效,该方法可控制关键声学参数,如响度、音高、混响、淡入淡出、亮度、噪声和持续时间,从而在声音设计和内容创作中实现创造性应用。这些参数超越了传统数字信号处理技术,融合了能够捕捉声音特性在上下文中如何被塑造的细微差别的学习表示,从而实现对生成音频更丰富、更精细的控制。我们的方法具有模型无关性,基于学习音频语义与其声学特征之间的解耦关系。该方法不仅增强了文本到音频生成的多样性和表现力,还为创意音频制作和声音设计开辟了新途径。我们的客观和主观评估结果表明,该方法能有效生成高质量、可定制的音频输出,且与用户需求高度契合。