Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at \url{https://UniSpeaker.github.io}.
翻译:近年来,个性化语音生成技术的进步使得合成语音日益接近目标说话人录音的真实感,然而多模态说话人生成领域仍在持续发展。本文提出UniSpeaker,一种面向多模态驱动说话人生成的统一方法。具体而言,我们提出一种基于KV-Transformer的统一语音聚合器,应用软对比损失将多样化的语音描述模态映射到共享的语音空间中,确保生成的语音与输入描述更紧密地对齐。为评估多模态驱动的语音控制能力,我们构建了首个基于多模态的语音控制基准测试,重点关注语音匹配度、语音多样性和语音质量。通过该基准测试对UniSpeaker在五个任务上进行了评估,实验结果表明UniSpeaker优于以往的模态专用模型。语音样本可在 \url{https://UniSpeaker.github.io} 获取。