Equipping robotic faces with singing capabilities is crucial for empathetic Human-Robot Interaction. However, existing robotic face driving research primarily focuses on conversations or mimicking static expressions, struggling to meet the high demands for continuous emotional expression and coherence in singing. To address this, we propose a novel avatar-driven framework for appealing robotic singing. We first leverage portrait video generation models embedded with extensive human priors to synthesize vivid singing avatars, providing reliable expression and emotion guidance. Subsequently, these facial features are transferred to the robot via semantic-oriented mapping functions that span a wide expression space. Furthermore, to quantitatively evaluate the emotional richness of robotic singing, we propose the Emotion Dynamic Range metric to measure the emotional breadth within the Valence-Arousal space, revealing that a broad emotional spectrum is crucial for appealing performances. Comprehensive experiments prove that our method achieves rich emotional expressions while maintaining lip-audio synchronization, significantly outperforming existing approaches.
翻译:为机器人面部赋予歌唱能力对于实现共情式人机交互至关重要。然而,现有的机器人面部驱动研究主要集中于对话或模仿静态表情,难以满足歌唱中连续情感表达与连贯性的高要求。为此,我们提出了一种新颖的、基于虚拟形象驱动的框架,以实现富有吸引力的机器人歌唱。我们首先利用嵌入丰富人类先验知识的肖像视频生成模型,合成生动的歌唱虚拟形象,为机器人提供可靠的表情与情感指导。随后,这些面部特征通过面向语义的映射函数被迁移到机器人上,该函数能够覆盖宽广的表情空间。此外,为了定量评估机器人歌唱的情感丰富度,我们提出了情感动态范围度量标准,用于衡量其在效价-唤醒度空间内的情感广度,揭示了宽广的情感谱对于产生吸引人的表演至关重要。综合实验证明,我们的方法在保持唇部-音频同步的同时,实现了丰富的情感表达,显著优于现有方法。