We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at https://github.com/KoeAI/LLVC.
翻译:我们改编了先前的音频处理与生成神经网络架构,以适应实时任意到单一声源语音转换任务。由此产生的模型LLVC($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion,低延迟低资源语音转换)在16kHz比特率下延迟低于20毫秒,且在消费级CPU上运行速度比实时处理快约2.8倍。LLVC同时采用生成对抗架构和知识蒸馏技术来实现此性能。据我们所知,LLVC实现了所有开源语音转换模型中最低的资源占用和延迟。我们已在https://github.com/KoeAI/LLVC 上提供开源样本、代码和预训练模型权重。