Power has become a central bottleneck for AI inference. This problem is becoming more urgent as agentic AI emerges as a major workload class, yet prior power-management techniques focus almost entirely on single-turn LLM serving. Our analysis shows that agentic serving behaves fundamentally differently: each request carries long-lived context that evolves across tool-interleaved turns, and lowering GPU frequency can push the system into a thrashing regime where memory pressure sharply worsens both performance and power efficiency. These observations show that power optimization for agentic serving requires rethinking. We present KAIROS, a context-aware power optimization system for agentic AI serving. KAIROS uses agent context as a first-class control signal to jointly manage GPU frequency, per-instance concurrency, and multi-instance request placement. This enables KAIROS to save power when memory headroom exists while avoiding thrashing and preserving performance targets. At a high level, KAIROS tracks requests at agent granularity, adapts local control to context growth and agent progress, and routes agents across instances to jointly improve power efficiency and memory stability. Evaluated across diverse software and data engineering agentic tasks, KAIROS achieves an average of 27% (up to 39.8%) power reduction while meeting the performance targets.
翻译:功耗已成为AI推理的核心瓶颈。随着智能体AI成为主要工作负载类别,这一问题愈发紧迫,然而先前的功耗管理技术几乎完全专注于单轮LLM服务。我们的分析表明,智能体服务的行为存在本质差异:每个请求携带长期存在的上下文,这些上下文在工具交织的多轮交互中持续演变;降低GPU频率可能导致系统进入颠簸状态,此时内存压力会同时严重恶化性能和功耗效率。这些观察表明,智能体服务的功耗优化需要重新思考。我们提出KAIROS,一种面向智能体AI服务的上下文感知功耗优化系统。KAIROS将智能体上下文作为一级控制信号,协同管理GPU频率、每实例并发度及多实例请求放置。这使得KAIROS能在内存余量充足时节省功耗,同时避免颠簸并保障性能目标。在高层次上,KAIROS以智能体粒度为单元追踪请求,根据上下文增长和智能体进度自适应调整局部控制,并在实例间路由智能体以协同提升功耗效率与内存稳定性。在多样化的软件与数据工程智能体任务上的评估表明,KAIROS在满足性能目标的同时平均实现27%(最高达39.8%)的功耗降低。