Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
翻译:人类智能天然融合了跨越视觉、听觉和语言的全模态感知,并结合复杂推理与工具使用来与世界交互。然而,当前的多模态大语言模型主要局限于双模态交互(例如视觉-语言),缺乏通用人工智能助手所需的统一认知能力。为弥合这一差距,我们提出了OmniGAIA,一个旨在评估全模态智能体在视频、音频和图像模态上执行需要深度推理与多轮工具调用任务的综合性基准。通过一种新颖的全模态事件图方法构建,OmniGAIA合成了源自真实世界数据、需要跨模态推理与外部工具集成的复杂多跳查询。此外,我们提出了OmniAtlas,一种在工具集成推理范式下具备主动全模态感知能力的原生全模态基础智能体。通过后见指导树探索策略合成的轨迹以及用于细粒度错误纠正的OmniDPO进行训练,OmniAtlas有效增强了现有开源模型的工具使用能力。这项工作标志着为现实世界场景构建下一代原生全模态人工智能助手迈出了一步。