AI agents are increasingly deployed in multi-tenant cloud environments, where they execute diverse tool calls within sandboxed containers, each call with distinct resource demands and rapid fluctuations. We present a systematic characterization of OS-level resource dynamics in sandboxed AI coding agents, analyzing 144 software engineering tasks from the SWE-rebench benchmark across two LLM models. Our measurements reveal that (1) OS-level execution (tool calls, container and agent initialization) accounts for 56-74% of end-to-end task latency; (2) memory, not CPU, is the concurrency bottleneck; (3) memory spikes are tool-call-driven with a up to 15.4x peak-to-average ratio; and (4) resource demands are highly unpredictable across tasks, runs, and models. Comparing these characteristics against serverless, microservice, and batch workloads, we identify three mismatches in existing resource controls: a granularity mismatch (container-level policies vs. tool-call-level dynamics), a responsiveness mismatch (user-space reaction vs. sub-second unpredictable bursts), and an adaptability mismatch (history-based prediction vs. non-deterministic stateful execution). We propose AgentCgroup , an eBPF-based resource controller that addresses these mismatches through hierarchical cgroup structures aligned with tool-call boundaries, in-kernel enforcement via sched_ext and memcg_bpf_ops, and runtime-adaptive policies driven by in-kernel monitoring. Preliminary evaluation demonstrates improved multi-tenant isolation and reduced resource waste.
翻译:AI代理正日益部署于多租户云环境中,其在沙盒化容器内执行多样化的工具调用,每次调用具有不同的资源需求且波动迅速。本文对沙盒化AI编程代理中操作系统级资源动态进行了系统性刻画,基于SWE-rebench基准中的144个软件工程任务,在两个LLM模型上展开分析。测量结果表明:(1) 操作系统级执行(工具调用、容器与代理初始化)占端到端任务延迟的56-74%;(2) 内存而非CPU成为并发瓶颈;(3) 内存峰值由工具调用驱动,峰均比最高达15.4倍;(4) 不同任务、运行轮次和模型间的资源需求高度不可预测。通过将这些特征与无服务器、微服务及批处理工作负载进行对比,我们识别出现有资源控制机制中的三个失配问题:粒度失配(容器级策略与工具调用级动态)、响应性失配(用户空间响应与亚秒级不可预测突发)以及适应性失配(基于历史的预测与非确定性的有状态执行)。为此,我们提出AgentCgroup——一个基于eBPF的资源控制器,它通过以下方式解决这些失配:构建与工具调用边界对齐的层级式cgroup结构,借助sched_ext与memcg_bpf_ops实现内核态强制管控,并基于内核态监测驱动运行时自适应策略。初步评估表明,该系统能提升多租户隔离性并减少资源浪费。