Controllable timbre synthesis has been a subject of research for several decades, and deep neural networks have been the most successful in this area. Deep generative models such as Variational Autoencoders (VAEs) have the ability to generate a high-level representation of audio while providing a structured latent space. Despite their advantages, the interpretability of these latent spaces in terms of human perception is often limited. To address this limitation and enhance the control over timbre generation, we propose a regularized VAE-based latent space that incorporates timbre descriptors. Moreover, we suggest a more concise representation of sound by utilizing its harmonic content, in order to minimize the dimensionality of the latent space.
翻译:可控音色合成一直是数十年来的研究课题,而深度神经网络在该领域取得了最成功的成果。变分自编码器等深度生成模型能够生成音频的高级表征,同时提供结构化的潜在空间。尽管具有这些优势,但这些潜在空间在人类感知层面的可解释性通常有限。为解决这一局限并增强对音色生成的控制,我们提出了一种基于正则化VAE的潜在空间,该空间融合了音色描述符。此外,我们建议通过利用音频的谐波内容来构建更简洁的表征,从而最小化潜在空间的维度。