Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. To address this, we propose the integration of time-varying feature-wise linear modulation into existing temporal convolutional backbones, an approach that enables learnable adaptation of the intermediate activations. We demonstrate that our approach more accurately captures long-range dependencies for a range of fuzz and compressor implementations across both time and frequency domain metrics. We provide sound examples, source code, and pretrained models to faciliate reproducibility.
翻译:深度学习方法在黑箱建模音频效果方面展现出潜力,然而现有工作主要集中在非线性效果上,此类效果的行为时间尺度相对较短,例如吉他放大器和失真效果。尽管循环和卷积架构理论上可扩展以捕捉较长时间尺度的行为,但我们发现,在建模法兹和动态范围压缩等音频效果时,简单地缩放现有架构的宽度、深度或膨胀系数并不能获得令人满意的性能。为解决此问题,我们提出将时变特征线性调制集成到现有时间卷积主干网络中,该方法能够实现中间激活的可学习自适应调整。实验表明,我们的方法在时域和频域指标上,均能更准确地捕捉多种法兹和压缩器实现中的长程依赖关系。我们提供声音示例、源代码和预训练模型以促进结果复现。