This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker's voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide range of scenarios including seen and unseen speakers conversing in seen and unseen lingua, we establish that our proposed model is able to achieve substantial speaker similarity, and is able to generalize to out-of-domain (OOD) cases.
翻译:本文探讨语言无关的说话人复刻这一新兴任务,旨在复刻说话人的语音特征而不受其所说语言的影响。为此,我们提出一种多层级注意力聚合方法,该方法以分层方式系统性地探测并增强各类说话人特定属性。通过在涵盖已知/未知说话人使用已知/未知语言的广泛场景下进行严谨评估,我们验证了所提模型能够实现显著的说话人相似度,并具备跨域泛化能力。