This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., "make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception.
翻译:本文提出Text2FX方法,该方法利用CLAP嵌入与可微分数字信号处理技术,通过开放词汇的自然语言提示(例如“使这段声音具有冲击感且饱满”)来控制均衡与混响等音频效果。Text2FX无需重新训练任何模型,而是依赖现有嵌入空间内的单实例优化实现。我们证明CLAP编码了用于控制音频效果的重要信息,并提出两种基于CLAP的优化方法,将文本映射至音频效果参数。虽然本文以CLAP为例进行演示,但该方法适用于任何共享的文本-音频嵌入空间。同样地,尽管我们以均衡和混响效果为例,任何可微分的音频效果均可通过此方法控制。我们通过包含多样化文本提示与源音频的听者实验,评估了这些方法在质量及与人类感知对齐方面的表现。