Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approaches fall into two extremes: application-level recovery preserves chat history but misses OS-side effects, while full per-turn checkpointing is correct but too expensive under dense co-location. The root cause is an agent-OS semantic gap: agent frameworks see tool calls but not their OS effects; the OS sees state changes but lacks turn-level context to judge recovery relevance. This gap hides massive sparsity: over 75% of agent turns produce no recovery-relevant state, so most checkpoints are unnecessary. Crab (Checkpoint-and-Restore for Agent SandBoxes) is a transparent host-side runtime that bridges this gap without modifying agents or C/R backends. An eBPF-based inspector classifies each turn's OS-visible effects to decide checkpoint granularity; a coordinator aligns checkpoints with turn boundaries and overlaps C/R with LLM wait time; and a host-scoped engine schedules checkpoint traffic across co-located sandboxes. On shell-intensive and code-repair workloads, Crab raises recovery correctness from 8% (chat-only) to 100%, cuts checkpoint traffic by up to 87%, and stays within 1.9% of fault-free execution time.
翻译:自主智能体通过沙箱容器和微虚拟机执行操作,其状态涵盖文件系统、进程和运行时产物。对此类状态的检查点与恢复(C/R)功能对于容错、即时执行、强化学习回滚分支以及安全回退至关重要——然而现有方法陷入两种极端:应用层恢复保留对话历史但丢失操作系统侧副作用,而全量轮次级检查点虽正确但在高密度部署场景下代价过高。根本原因在于智能体与操作系统之间存在语义鸿沟:智能体框架感知工具调用但无法追踪其对操作系统的影响;操作系统虽能检测状态变化但缺乏轮次级上下文来判定恢复相关性。这种鸿沟暴露出巨大的稀疏性:超过75%的智能体轮次未产生与恢复相关的状态,因此大多数检查点实无必要。Crab(面向智能体沙箱的检查点与恢复系统)是一种透明的宿主机侧运行时,无需修改智能体或C/R后端即可弥合此鸿沟。该系统通过基于eBPF的检测模块对每轮次操作系统可见影响进行分类以决定检查点粒度;协调模块将检查点对齐至轮次边界并使C/R与LLM等待时间重叠;宿主机级调度引擎在共置沙箱间编排检查点流量。在Shell密集型负载和代码修复任务中,Crab将恢复正确率从8%(仅对话模式)提升至100%,检查点流量削减高达87%,且执行时间偏差控制在无故障场景的1.9%以内。