Speech is essential for realistic role-playing, yet existing work on role-playing agents largely centers on text, leaving Speech Role-Playing Agents (SRPAs) underexplored and without systematic evaluation. We introduce SpeechRole, a unified framework for developing and assessing SRPAs. SpeechRole-Data contains 98 roles and 111k speech-to-speech conversations with rich timbre and prosodic variation, providing large-scale resources for training SRPAs. SpeechRole-Eval offers a multidimensional benchmark that directly evaluates generated speech, preserving paralinguistic cues and measuring interaction ability, speech expressiveness, and role-playing fidelity. Experiments show that end-to-end SRPAs such as GPT-4o Audio achieve strong fluency and naturalness, but remain limited in prosody consistency and emotion appropriateness. In contrast, current open-source end-to-end models exhibit substantial performance gaps across multiple evaluation dimensions. Cascaded and end-to-end systems achieve comparable results in interaction ability and role-playing fidelity, suggesting that these aspects are still largely influenced by the underlying text-based language models. We release all data, code, and evaluation tools at https://github.com/yuhui1038/SpeechRole.
翻译:摘要:语音对于实现逼真的角色扮演至关重要,然而现有关于角色扮演智能体的研究主要集中于文本领域,导致语音角色扮演智能体(SRPAs)的探索不足且缺乏系统性的评估。我们提出SpeechRole,这是一个用于开发与评估SRPAs的统一框架。其中,SpeechRole-Data包含98个角色和11.1万段语音到语音的对话,具有丰富的音色与韵律变化,为训练SRPAs提供了大规模资源。SpeechRole-Eval则提供了一个多维度的基准,直接评估生成的语音,保留副语言线索,并衡量交互能力、语音表现力以及角色扮演保真度。实验表明,端到端SRPAs(如GPT-4o Audio)在流畅度和自然度上表现优异,但在韵律一致性和情感恰当性方面仍有局限。相比之下,当前开源的端到端模型在多个评估维度上存在显著的性能差距。级联系统与端到端系统在交互能力和角色扮演保真度上结果相当,这表明这些方面在很大程度上仍受底层基于文本的语言模型影响。我们在https://github.com/yuhui1038/SpeechRole发布所有数据、代码及评估工具。