Musicians and audio engineers sculpt and transform their sounds by connecting multiple processors, forming an audio processing graph. However, most deep-learning methods overlook this real-world practice and assume fixed graph settings. To bridge this gap, we develop a system that reconstructs the entire graph from a given reference audio. We first generate a realistic graph-reference pair dataset and train a simple blind estimation system composed of a convolutional reference encoder and a transformer-based graph decoder. We apply our model to singing voice effects and drum mixing estimation tasks. Evaluation results show that our method can reconstruct complex signal routings, including multi-band processing and sidechaining.
翻译:音乐人和音频工程师通过连接多个处理器来塑造和变换声音,形成音频处理图。然而,大多数深度学习方法忽视了这一实际应用场景,并假设固定的图设置。为了弥合这一差距,我们开发了一个系统,能够从给定的参考音频中重建完整的处理图。我们首先生成一个真实的图-参考音频配对数据集,并训练一个由卷积参考编码器和基于Transformer的图解码器组成的简单盲估计系统。我们将该模型应用于演唱声效和鼓混音估计任务。评估结果表明,我们的方法能够重建复杂的信号路由,包括多频带处理和侧链压缩。