CaveAgent：将大型语言模型转化为状态化运行时操作符 (CaveAgent: Transforming LLMs into Stateful Runtime Operators)

Maohao Ran,Zhenglin Wan,Cooper Lin,Yanting Zhang,Hongyu Xin,Hongwei Fan,Yibo Xu,Beier Luo,Yaxin Zhou,Wangbo Zhao,Lijie Yang,Lang Feng,Fuchao Yang,Jingxuan Wu,Yiqiao Huang,Chendong Ma,Dailing Jiang,Jianbo Deng,Sihui Han,Bo An,Yike Guo,Jun Song

from arxiv, 32 pages, 14 Figures

LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent's superiority. Specifically, our framework achieves a 10.5\% success rate improvement on retail tasks and reduces total token consumption by 28.4\% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59\%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.

翻译：基于大型语言模型的智能体在执行复杂任务方面能力日益增强，但当前智能体系统仍受限于以文本为中心的范式。传统方法依赖基于JSON的过程式函数调用，由于脆弱的多轮依赖和上下文漂移问题，在处理长视野任务时往往表现不佳。本文提出CaveAgent框架，将范式从“LLM作为文本生成器”转变为“LLM作为运行时操作符”。我们引入双流上下文架构，将状态管理解耦为用于推理的轻量级语义流和用于执行的持久化确定性Python运行时流。除了利用代码生成在单步内高效解决相互依赖的子任务（如循环、条件判断）外，我们在CaveAgent中引入了\textit{状态化运行时管理}。与现有仍受文本限制且缺乏外部对象注入与检索支持的代码方法不同，CaveAgent能够跨轮次注入、操作和检索持久化的复杂Python对象（如DataFrame、数据库连接）。这种持久化机制作为高保真外部存储器，可消除上下文漂移，避免灾难性遗忘，同时确保处理后的数据无损传递至下游应用。在Tau$^2$-bench、BFCL基准测试及多个代表性SOTA大语言模型的案例研究中进行的综合评估证明了CaveAgent的优越性。具体而言，本框架在零售任务上实现了10.5\%的成功率提升，在多轮场景中总令牌消耗降低28.4\%。在数据密集型任务中，直接变量存储与检索使令牌消耗减少59\%，使CaveAgent能够处理导致基于JSON和基于代码的智能体出现上下文溢出故障的大规模数据。