With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.
翻译:随着移动设备计算能力的不断提升,在异构片上系统(SoC)上本地部署智能检索增强生成(RAG)已成为增强基于大语言模型(LLM)应用前景广阔的方式。然而,智能RAG引入了包含异构模型和动态执行流程的多阶段工作流,而移动SoC则表现出强烈的加速器亲和性、形状敏感性以及共享内存带宽争用,这使得简单的调度策略效率低下。我们提出了HeRo,一个面向移动SoC上低延迟智能RAG的异构感知框架。HeRo为每个子阶段和模型-处理单元配置建立了基于性能剖析的性能模型,捕获延迟、工作负载形状以及争用引起的减速,并利用这些模型在一个轻量级在线调度器中结合了形状感知的子阶段划分、基于关键性的加速器映射以及带宽感知的并发控制。在商用移动设备上的实验表明,与现有部署策略相比,HeRo将端到端延迟降低了高达$10.94\times$,从而实现了实用的设备端智能RAG。