Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.
翻译:全模态大语言模型(OLLMs)旨在统一多模态理解与生成能力,然而在自然的人机交互中至关重要的同步生成语音与3D面部动画这一方向仍鲜有探索。其核心挑战在于大语言模型离散的语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出一种开源模型Expressive Omni(Ex-Omni),该模型通过原生支持语音伴随的3D面部动画来增强OLLMs。Ex-Omni通过基于混合变形体的语音单元生成器与混合变形体解码器,将语义推理与时间生成相解耦——其中语音单元提供时间支架,而隐式语音表示则携带面部相关线索。我们进一步引入统一的令牌即查询门控融合(TQGF)机制以实现可控的语义注入,并构建了包含1200K样本的预训练数据集InstructS2SF-1200K。大量实验表明,Ex-Omni在保持竞争力的语音理解与生成能力的同时,相比级联流水线实现了更优的视听同步性与更低的面部生成延迟。