CaveAgent: Transforming LLMs into Stateful Runtime Operators

Maohao Ran,Zhenglin Wan,Cooper Lin,Yanting Zhang,Hongyu Xin,Hongwei Fan,Yibo Xu,Beier Luo,Yaxin Zhou,Wangbo Zhao,Lijie Yang,Lang Feng,Fuchao Yang,Jingxuan Wu,Yiqiao Huang,Chendong Ma,Dailing Jiang,Jianbo Deng,Sihui Han,Yang You,Bo An,Yike Guo,Jun Song

from arxiv, version 2

LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. We present CaveAgent, a framework that shifts tool use from ``LLM-as-Text-Generator'' to ``LLM-as-Runtime-Operator.'' CaveAgent introduces a dual-stream architecture that inverts the conventional paradigm: rather than treating the LLM's text context as the primary workspace with tools as auxiliary, CaveAgent elevates the persistent Python runtime as the central locus of state, with a lightweight semantic stream serving as its orchestrator. Beyond leveraging code generation to resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces \textit{Stateful Runtime Management}: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code-based approaches that remain text-bound. CaveAgent further provides a runtime-integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high-fidelity external memory that reduces context drift in multi-turn interactions and preserves processed data for downstream applications without information loss. Evaluations show consistent improvement across challenging benchmarks, enabling CaveAgent to handle data scales that cause context overflow in both JSON-based and code-based agents. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).

翻译：基于大语言模型（LLM）的智能体在执行复杂任务方面能力日益增强，但当前智能体系统仍受限于以文本为中心的范式，由于脆弱的多轮依赖关系和上下文漂移问题，难以处理长周期任务。本文提出CaveAgent框架，将工具使用范式从“LLM作为文本生成器”转变为“LLM作为运行时操作符”。CaveAgent采用双流架构，颠覆了传统范式：不再将LLM的文本上下文作为主要工作空间、工具作为辅助，而是将持久化的Python运行时提升为核心状态载体，并通过轻量级语义流作为其编排器。除了利用代码生成在单步内解决相互依赖的子任务（如循环、条件判断）外，CaveAgent引入了\textit{状态化运行时管理}机制：与现有仍受文本限制的代码方法不同，该机制能够跨轮次注入、操作和检索持久化的复杂Python对象（如DataFrame、数据库连接）。CaveAgent进一步提供运行时集成的技能管理系统，扩展了Agent Skills开放标准，通过可执行技能注入实现生态系统互操作性。这种持久化机制作为高保真外部记忆，可减少多轮交互中的上下文漂移，并为下游应用保存已处理数据而避免信息损失。评估结果表明，在多个挑战性基准测试中均取得持续改进，使CaveAgent能够处理导致基于JSON和基于代码的智能体上下文溢出的数据规模。可访问的运行时状态进一步提供可编程验证的反馈，无需人工标注即可实现自动评估与奖励信号生成，为未来可验证奖励强化学习（RLVR）研究奠定了结构性基础。