Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.
翻译:尽管文本与视觉生成领域取得了进展,但生成连贯的长篇音频叙事仍具挑战性。现有框架往往存在角色设定与语音表现不匹配、自我校正机制不足以及人机交互受限等局限。为解决这些问题,我们提出AuDirector——一种自反式闭环多智能体框架。具体而言,该框架包含身份感知预生产机制,能将叙事文本转化为角色画像及话语级情感指令,从而检索适配的语音候选并引导富有表现力的语音合成,促进与语境对齐的语音适配。为提升质量,协作合成与校正模块引入闭环自校正机制,系统性地审计并重新生成有缺陷的音频组件。此外,人类引导的交互式优化模块通过解析自然语言反馈实现用户控制,可交互式优化底层脚本。实验表明,相比现有最优基线,AuDirector在结构连贯性、情感表现力与声学保真度方面均展现出更优性能。音频样本可访问https://anonymous-itsh.github.io/获取。