The virtual world is being established in which digital humans are created indistinguishable from real humans. Producing their audio-related capabilities is crucial since voice conveys extensive personal characteristics. We aim to create a controllable audio-form virtual singer; however, supervised modeling and controlling all different factors of the singing voice, such as timbre, tempo, pitch, and lyrics, is extremely difficult since accurately labeling all such information needs enormous labor work. In this paper, we propose a framework that could digitize a person's voice by simply "listening" to the clean voice recordings of any content in a fully unsupervised manner and predict singing voices even only using speaking recordings. A variational auto-encoder (VAE) based framework is developed, which leverages a set of pre-trained models to encode the audio as various hidden embeddings representing different factors of the singing voice, and further decodes the embeddings into raw audio. By manipulating the hidden embeddings for different factors, the resulting singing voices can be controlled, and new virtual singers can also be further generated by interpolating between timbres. Evaluations of different types of experiments demonstrate the proposed method's effectiveness. The proposed method is the critical technique for producing the AI choir, which empowered the human-AI symbiotic orchestra in Hong Kong in July 2022.
翻译:虚拟世界正在建立,其中创建的数字人类与真实人类难以区分。由于声音承载着广泛的个人特征,生成与音频相关的能力至关重要。我们旨在创建一个可控制的音频形式的虚拟歌手;然而,对所有不同因素(如音色、节奏、音高和歌词)进行有监督建模和控制极为困难,因为准确标注所有这些信息需要大量人工劳动。在本文中,我们提出一个框架,通过简单地“倾听”任何内容的干净语音录音,以完全无监督的方式数字化一个人的声音,甚至仅使用说话录音即可预测唱歌声音。我们开发了一个基于变分自编码器(VAE)的框架,该框架利用一组预训练模型将音频编码为表示唱歌声音不同因素的多种隐藏嵌入,并进一步将这些嵌入解码为原始音频。通过操纵不同因素的隐藏嵌入,可以控制产生的唱歌声音,还可以通过插值音色进一步生成新的虚拟歌手。不同类型实验的评估证明了所提出方法的有效性。该方法是为人工智能合唱团提供的关键技术,该技术于2022年7月在香港赋能了人机共生乐团。