Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.
翻译:音频大语言模型(AudioLLMs)实现了对语音和通用音频的指令跟随,但其发展日益受到缺乏多样化、对话式、指令对齐的语音-文本数据的限制。这一瓶颈在基于人物角色的交互和方言覆盖方面尤为突出,因为收集和发布真实的多说话人录音成本高昂且进展缓慢。我们推出了MENASpeechBank,这是一个参考语音库,包含来自124位说话人的约18K条高质量话语,这些说话人遍布多个中东和北非地区国家,覆盖英语、现代标准阿拉伯语(MSA)以及区域性阿拉伯语变体。基于此资源,我们开发了一个可控的合成数据流水线,该流水线:(i)构建了融入世界价值观调查启发的属性的人物角色档案,(ii)定义了包含约5K个对话场景的分类体系,(iii)通过语义相似度将人物角色与场景进行匹配,(iv)利用一个大语言模型生成约417K个角色扮演对话,其中用户以人物角色身份发言,助手则扮演一个乐于助人的代理,以及(v)通过以参考说话人音频为条件来合成用户轮次,以保持说话人身份和多样性。我们对合成对话和人工录制对话均进行了评估,并提供了详细分析。我们将向社区公开发布MENASpeechBank及生成的对话。