Despite significant advances in deep models for music generation, the use of these techniques remains restricted to expert users. Before being democratized among musicians, generative models must first provide expressive control over the generation, as this conditions the integration of deep generative models in creative workflows. In this paper, we tackle this issue by introducing a deep generative audio model providing expressive and continuous descriptor-based control, while remaining lightweight enough to be embedded in a hardware synthesizer. We enforce the controllability of real-time generation by explicitly removing salient musical features in the latent space using an adversarial confusion criterion. User-specified features are then reintroduced as additional conditioning information, allowing for continuous control of the generation, akin to a synthesizer knob. We assess the performance of our method on a wide variety of sounds including instrumental, percussive and speech recordings while providing both timbre and attributes transfer, allowing new ways of generating sounds.
翻译:尽管深度模型在音乐生成领域取得了显著进展,但这些技术的使用仍局限于专业用户。在音乐家群体中普及之前,生成模型必须首先提供对生成过程的表达性控制,因为这是将深度生成模型集成到创意工作流中的前提条件。本文通过引入一种深度生成音频模型来解决这一问题,该模型提供基于描述符的连续表达性控制,同时保持足够轻量化以嵌入硬件合成器。我们采用对抗混淆准则,在潜在空间中显式去除显著的音色特征,从而实现对实时生成的可控性。随后将用户指定的特征作为额外条件信息重新引入,使得能够像合成器旋钮一样对生成过程进行连续控制。我们评估了该方法在包括乐器、打击乐和语音录音在内的多种声音上的表现,同时提供了音色和属性迁移功能,为声音生成开辟了新途径。