In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post-training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent's full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads-terminal-based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.
翻译:在LLM智能体的强化学习后训练过程中,对外部工具的调用通常需要数秒甚至数分钟,导致已分配的GPU处于空闲状态,并显著增加了后训练的时间与成本。尽管许多工具调用在并行推演中会重复出现,理论上可进行缓存,但简单地缓存其输出以供重用是不正确的,因为工具输出依赖于先前智能体交互所诱导的环境状态。本文提出TVCACHE,一种用于LLM智能体后训练的有状态工具值缓存。TVCACHE维护一个已观测工具调用序列的树形结构,并通过最长前缀匹配执行缓存查询:仅当智能体的完整工具调用历史与先前执行过的序列完全匹配时才会命中缓存,从而保证环境状态的一致性。在三个多样化的工作负载上——基于终端的任务、SQL生成和视频理解,TVCACHE实现了高达70%的缓存命中率,并将工具调用的中位执行时间最多降低6.9倍,且未对后训练的奖励累积产生负面影响。