AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.

翻译：高质量多模态数据集的构建仍是提升大语言模型角色扮演能力的基础。现有工作主要聚焦于基于文本的角色模拟，而音频角色扮演因需同步对齐语义内容与声音特征，面临独特挑战。为此，我们提出AudioRole——一个源自13部电视剧、涵盖超过1000小时时长及百万级角色对话的精心策划数据集，提供标有说话者身份和上下文元数据的同步音频-文本对。此外，为验证数据集有效性，我们引入ARP-Eval双维度评估框架，从回复质量和角色保真度两个层面进行衡量。实验表明，基于AudioRole训练的GLM-4-Voice（称为ARP-Model）在声学个性化评分上达到0.31，显著优于原始GLM-4-Voice及更强大的单次场景角色扮演模型MiniCPM-O-2.6。ARP-Model在内容个性化评分上达到0.36，较未训练的原始模型提升约38%，并与MiniCPM-O-2.6保持同等水平。AudioRole包含来自115个以上主要角色的对话、6个扮演不同角色的ARP-Model训练模型及评估协议，共同为推进基于音频的角色扮演研究提供关键资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

14+阅读 · 5月21日

音视频大数据基础模型全面综述

专知会员服务

11+阅读 · 5月7日

《语音大语言模型》最新进展综述

专知会员服务

58+阅读 · 2024年10月8日

数据与多模态大型语言模型的协同作用综述

专知会员服务

59+阅读 · 2024年7月13日