CaveAgent: Transforming LLMs into Stateful Runtime Operators

Maohao Ran,Zhenglin Wan,Cooper Lin,Yanting Zhang,Hongyu Xin,Hongwei Fan,Yibo Xu,Beier Luo,Yaxin Zhou,Wangbo Zhao,Lijie Yang,Lang Feng,Fuchao Yang,Jingxuan Wu,Yiqiao Huang,Chendong Ma,Dailing Jiang,Jianbo Deng,Sirui Han,Yang You,Bo An,Yike Guo,Jun Song

from arxiv, ver.2

LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. We present CaveAgent, a framework that shifts tool use from ``LLM-as-Text-Generator'' to ``LLM-as-Runtime-Operator.'' CaveAgent introduces a dual-stream architecture that inverts the conventional paradigm: rather than treating the LLM's text context as the primary workspace with tools as auxiliary, CaveAgent elevates the persistent Python runtime as the central locus of state, with a lightweight semantic stream serving as its orchestrator. Beyond leveraging code generation to resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces \textit{Stateful Runtime Management}: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code-based approaches that remain text-bound. CaveAgent further provides a runtime-integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high-fidelity external memory that reduces context drift in multi-turn interactions and preserves processed data for downstream applications without information loss. Evaluations show consistent improvement across challenging benchmarks, enabling CaveAgent to handle data scales that cause context overflow in both JSON-based and code-based agents. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).

翻译：基于大型语言模型的智能体在执行复杂任务方面能力日益增强，然而当前的智能体系统仍受限于以文本为中心的范式，由于脆弱的多轮依赖关系和上下文漂移问题，难以处理长时程任务。本文提出CaveAgent框架，将工具使用范式从"LLM作为文本生成器"转变为"LLM作为运行时操作符"。CaveAgent采用双流架构，颠覆了传统范式：不同于将LLM文本上下文作为主要工作空间、工具作为辅助的传统方式，CaveAgent将持久化Python运行时提升为状态的核心载体，由轻量级语义流担任其协调器。除了通过代码生成在单步内解决相互依赖的子任务（如循环、条件判断）外，CaveAgent引入了\textit{状态化运行时管理}机制：与现有基于代码但仍受文本限制的方法不同，该机制能够注入、操作和检索跨轮次持久存在的复杂Python对象（如DataFrame、数据库连接）。CaveAgent进一步提供运行时集成的技能管理系统，扩展了Agent Skills开放标准，通过可执行技能注入实现生态系统互操作性。这种持久化机制作为高保真外部记忆，减少了多轮交互中的上下文漂移，并为下游应用保存已处理数据而无需信息损失。评估结果表明，在具有挑战性的基准测试中均取得持续改进，使CaveAgent能够处理导致基于JSON和基于代码的智能体出现上下文溢出的数据规模。可访问的运行时状态进一步提供可编程验证的反馈，无需人工标注即可实现自动评估和奖励信号生成，为未来可验证奖励强化学习（RLVR）研究奠定了结构性基础。