There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add several seconds or more of latency, which is prohibitive for real-time latency-sensitive applications. In our work, we propose Speculative Interaction Agents to enable real-time interaction even for agents with complex multi-turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason-and-act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out-of-the-box to existing real-time cloud APIs, providing 1.3-1.7$\times$ speedups with minor accuracy loss. To enable real-time interaction with small edge-scale models, we also present a clock-based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6-2.2$\times$ speedups with the Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models across multiple tool calling benchmarks.
翻译:针对客户服务、个人助理等下游应用场景,对智能体AI技术的需求日益增长。当智能体需与人类进行交互时,实时低延迟响应成为关键——例如语音控制应用通常要求延迟低于1秒才能实现无缝交互。然而,若要让大语言模型(LLM)通过工具调用实现推理与智能体工作流执行,其额外产生的数秒甚至更长延迟,对实时性敏感的交互场景是难以接受的。本研究提出推测交互智能体(Speculative Interaction Agents)框架,即使对包含复杂多轮工具调用的智能体,也可实现实时交互。我们提出异步I/O机制,将核心的智能体“推理-执行”线程与等待用户/环境信息的流程解耦,从而在外部等待期间并行处理智能体任务。同时提出推测式工具调用(Speculative Tool Calling)方法,在智能体尚未确认是否已接收完整信息、或后续可能收到用户补充信息时,仍可推进任务执行。对于云端强模型,该方法可直接适配现有实时云API,在精度微损条件下实现1.3-1.7倍加速。为支持边缘端小模型的实时交互,我们进一步提出基于时钟的训练方法,使模型适应流式输入与异步响应,并演示了面向监督微调(SFT)的合成数据生成策略。综合而言,该方法在Qwen2.5-3B-Instruct与Llama-3.2-3B-Instruct模型上通过多项工具调用基准测试,实现了1.6-2.2倍加速。