The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.
翻译:高质量多模态数据集的构建仍是提升大语言模型角色扮演能力的基础。现有工作主要聚焦于基于文本的角色模拟,而音频角色扮演因需同步对齐语义内容与声音特征,面临独特挑战。为此,我们提出AudioRole——一个源自13部电视剧、涵盖超过1000小时时长及百万级角色对话的精心策划数据集,提供标有说话者身份和上下文元数据的同步音频-文本对。此外,为验证数据集有效性,我们引入ARP-Eval双维度评估框架,从回复质量和角色保真度两个层面进行衡量。实验表明,基于AudioRole训练的GLM-4-Voice(称为ARP-Model)在声学个性化评分上达到0.31,显著优于原始GLM-4-Voice及更强大的单次场景角色扮演模型MiniCPM-O-2.6。ARP-Model在内容个性化评分上达到0.36,较未训练的原始模型提升约38%,并与MiniCPM-O-2.6保持同等水平。AudioRole包含来自115个以上主要角色的对话、6个扮演不同角色的ARP-Model训练模型及评估协议,共同为推进基于音频的角色扮演研究提供关键资源。