Edge-cloud collaborative inference is becoming a practical necessity for LLM-powered edge devices: on-device models often cannot afford the required reasoning capability, while cloud-only inference could be prohibitively costly and slow under strict latency and token/API budgets. However, existing edge-cloud collaboration methods often route per query or fixed steps simply based-on the estimated difficulty. Such coarse and static heuristics overlook subtask dependencies, missing opportunities for parallel execution and budget-adaptive routing. To this end, we propose \textbf{HybridFlow}, a resource-adaptive edge-cloud inference framework that (i) builds a dependency-aware DAG for each query and executes newly unlocked subtasks in parallel, reducing end-to-end latency; (ii) routes each subtask online to the edge or cloud via a learned benefit--cost utility model that dynamically trades accuracy gains against token/API and latency budgets, thereby reducing unnecessary cloud usage while preserving reasoning quality. Across GPQA, MMLU-Pro, AIME24, and LiveBench-Reasoning, HybridFlow improves the cost-accuracy trade-off, reducing latency and cloud API usage while maintaining competitive accuracy against strong structured reasoning baselines.
翻译:边缘-云端协同推理正逐渐成为赋能边缘设备的大语言模型的现实需求:设备端模型通常无法承担所需的推理能力,而纯云端推理在严格的延迟与token/API预算约束下可能成本过高且速度缓慢。然而,现有的边缘-云端协作方法通常仅基于预估难度对每个查询或固定步骤进行路由。这种粗粒度且静态的启发式策略忽视了子任务间的依赖关系,错失了并行执行与预算自适应路由的机会。为此,我们提出\textbf{HybridFlow},一种资源自适应的边缘-云端推理框架,该框架(i)为每个查询构建依赖感知的有向无环图,并并行执行新解锁的子任务,从而降低端到端延迟;(ii)通过一个学习得到的效益-成本效用模型在线将每个子任务路由至边缘或云端,该模型动态权衡精度提升与token/API及延迟预算,从而在保持推理质量的同时减少不必要的云端使用。在GPQA、MMLU-Pro、AIME24和LiveBench-Reasoning等基准测试中,HybridFlow优化了成本-精度权衡,在保持与强结构化推理基线相竞争的精度同时,显著降低了延迟与云端API使用量。