Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, without requiring a modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method's ability to capture time-varying dynamics in a fully black-box context.
翻译:深度学习已成为音频效果建模的标准方法,但对于时变系统而言,严格的黑箱建模仍存在问题。与时间不变的效果不同,对具有内部调制机制的器件进行模型训练通常需要记录或提取控制信号,以确保标准损失函数所需的时间对齐性。本文提出一种生成对抗网络(GAN)框架,仅利用输入输出音频记录即可对此类效果进行建模,而无需提取调制信号。我们提出一种通过两阶段策略训练的卷积循环架构:初始对抗阶段使模型能够在无严格相位约束的条件下学习调制行为的分布,随后进入监督微调阶段,其中状态预测网络(SPN)用于估计模型与目标同步所需的初始内部状态。此外,还开发了一种基于啁啾序列信号的新指标用于量化调制精度。针对复古硬件移相器的建模实验表明,该方法能够在完全黑箱环境下捕捉时变动态特性。