Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
翻译:智能体多模态大语言模型(如OpenAI o3和Gemini Agentic Vision)通过迭代调用视觉工具实现卓越的推理能力。然而,感知、推理与工具调用的级联循环带来了显著的顺序开销。这种被称为“智能体深度”的开销不仅导致延迟过高,还严重限制系统级并发性。为此,我们提出SpecEyes——一种突破该顺序瓶颈的智能体级推测加速框架。核心思路在于:轻量级无工具多模态大语言模型可作为推测规划器预测执行轨迹,从而在不牺牲准确率的前提下提前终止高代价工具链。为规范该推测过程,我们基于答案可分离性引入认知门控机制,无需标注数据即可量化模型自我验证的置信度。此外,我们设计异构并行漏斗结构,利用轻量模型的非状态并发特性遮盖大模型的有状态串行执行过程,最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明,SpecEyes在保持甚至提升准确率(最高提升6.7%)的同时,可实现1.1-3.35倍的加速比,从而显著提升并发工作负载下的服务吞吐量。