Although the design and application of audio effects is well understood, the inverse problem of removing these effects is significantly more challenging and far less studied. Recently, deep learning has been applied to audio effect removal; however, existing approaches have focused on narrow formulations considering only one effect or source type at a time. In realistic scenarios, multiple effects are applied with varying source content. This motivates a more general task, which we refer to as general purpose audio effect removal. We developed a dataset for this task using five audio effects across four different sources and used it to train and evaluate a set of existing architectures. We found that no single model performed optimally on all effect types and sources. To address this, we introduced RemFX, an approach designed to mirror the compositionality of applied effects. We first trained a set of the best-performing effect-specific removal models and then leveraged an audio effect classification model to dynamically construct a graph of our models at inference. We found our approach to outperform single model baselines, although examples with many effects present remain challenging.
翻译:尽管音频效果的设计和应用已得到充分理解,但其逆问题——去除这些效果——则更具挑战性且研究甚少。近年来,深度学习已被应用于音频效果去除;然而,现有方法聚焦于狭窄的设定,每次仅考虑单一效果或音源类型。在现实场景中,多种效果会随不同源内容同时施加。这催生了一个更通用的任务,我们称之为通用音频效果去除。为此,我们构建了一个包含五种音频效果、涵盖四种不同源内容的数据库,并利用该数据集对一组现有架构进行训练和评估。结果表明,没有任何单一模型能在所有效果类型和源内容上达到最优性能。为解决这一问题,我们提出了RemFX——一种旨在模拟所施加效果组合性的方法。我们首先训练了一组针对特定效果表现最佳的去除模型,随后利用音频效果分类模型在推理时动态构建我们的模型图。实验证明,尽管处理包含多种效果的复杂样本仍具有挑战性,但我们的方法在整体上优于单一模型基线。