Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).
翻译:开发富有表现力且响应灵敏的对话式数字人是下一代人机交互的基石。虽然大语言模型显著增强了对话能力,但目前大多数系统仍然依赖于连接独立模块的级联架构。这些流水线通常受到累积误差、高延迟和实时性能差的困扰。由于无法访问底层的对话上下文,这些流水线本质上优先考虑僵硬的唇部同步,而非情感深度。为了应对这些挑战,我们提出了A$^2$-LLM,一种端到端的对话式音频化身大语言模型,它在一个统一的框架内联合推理语言、音频韵律和3D面部运动。为了促进训练,我们引入了FLAME-QA,一个高质量的多模态数据集,旨在QA格式中将语义意图与富有表现力的面部动态对齐。通过利用深层的语义理解,A$^2$-LLM能够生成超越简单唇部同步的、情感丰富的面部运动。实验结果表明,我们的系统在保持实时效率的同时,实现了卓越的情感表现力。