Neural audio synthesis methods can achieve high-fidelity and realistic sound generation by utilizing deep generative models. Such models typically rely on external labels which are often discrete as conditioning information to achieve guided sound generation. However, it remains difficult to control the subtle changes in sounds without appropriate and descriptive labels, especially given a limited dataset. This paper proposes an implicit conditioning method for neural audio synthesis using generative adversarial networks that allows for interpretable control of the acoustic features of synthesized sounds. Our technique creates a continuous conditioning space that enables timbre manipulation without relying on explicit labels. We further introduce an evaluation metric to explore controllability and demonstrate that our approach is effective in enabling a degree of controlled variation of different synthesized sound effects for in-domain and cross-domain sounds.
翻译:神经音频合成方法通过利用深度生成模型能够实现高保真度和逼真的声音生成。此类模型通常依赖外部标签作为条件信息以实现引导式声音生成,这些标签往往是离散的。然而,在没有适当且具有描述性的标签的情况下,尤其是在数据集有限的情况下,控制声音的细微变化仍然具有挑战性。本文提出了一种基于生成对抗网络的神经音频合成隐式条件方法,该方法允许对合成声音的声学特征进行可解释的控制。我们的技术创建了一个连续的条件空间,能够在无需依赖显式标签的情况下实现音色操控。我们进一步引入了一种评估指标来探索可控性,并证明我们的方法能够有效实现对领域内及跨领域声音的不同合成音效进行一定程度的受控变化。