Agent.xpu：在异构SoC上高效调度智能LLM工作负载 (Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC)

Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long-lived, stateful LLM flows that interleave prefill and token-by-token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on-device intelligence, existing LLM engines assume static, single-shot inference and lack mechanisms for flow-level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed-criticality execution patterns of personal agents. This paper presents Agent$.$xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agent$.$xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agent$.$xpu delivers 1.2-4.9$\times$ proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agent$.$xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.

翻译：个人LLM智能体日益将前台反应式交互与后台主动式监控相结合，形成长期存在、有状态的LLM工作流，这些工作流交错执行预填充和逐令牌解码。尽管现代异构SoC集成了CPU、iGPU和NPU以支持设备端智能，但现有的LLM引擎仍假设静态的单次推理，缺乏工作流级并发、优先级调度和高效加速器协调机制。因此，商用SoC与个人智能体动态混合关键性执行模式的匹配度仍然很低。本文提出Agent.xpu，这是首个在商用SoC上协调并发反应式与主动式LLM工作流的LLM引擎。通过广泛性能剖析，我们揭示了算子-加速器亲和性、非对称DDR争用以及阶段差异化批处理行为等独特的SoC特性，这些特性与云端服务假设截然不同。Agent.xpu引入三项关键技术：捕获NPU/iGPU亲和性并支持弹性算子绑定的异构执行图；具备阶段弹性的流感知NPU-iGPU协调机制，通过解耦预填充与解码以减少带宽争用并强制执行优先级；以及采用空闲感知捎带技术的细粒度抢占机制，在保证反应式响应性的同时避免主动式任务饥饿。在现实个人智能体工作负载测试中，相较于工业级纯iGPU服务引擎以及采用最优张量划分方案的NPU-iGPU静态推理系统，Agent.xpu将主动式吞吐量提升1.2-4.9倍，并将反应式延迟降低至少91%。同时，Agent.xpu通过受控的iGPU使用策略，实现了能耗与图形干扰的最小化。