Edge-cloud collaborative inference is becoming a practical necessity for LLM-powered edge devices: on-device models often cannot afford the required reasoning capability, while cloud-only inference could be prohibitively costly and slow under strict latency and token/API budgets. However, existing edge-cloud collaboration methods often route per query or fixed steps simply based-on the estimated difficulty. Such coarse and static heuristics overlook subtask dependencies, missing opportunities for parallel execution and budget-adaptive routing. To this end, we propose \textbf{HybridFlow}, a resource-adaptive edge-cloud inference framework that (i) builds a dependency-aware DAG for each query and executes newly unlocked subtasks in parallel, reducing end-to-end latency; (ii) routes each subtask online to the edge or cloud via a learned benefit--cost utility model that dynamically trades accuracy gains against token/API and latency budgets, thereby reducing unnecessary cloud usage while preserving reasoning quality. Across GPQA, MMLU-Pro, AIME24, and LiveBench-Reasoning, HybridFlow improves the cost-accuracy trade-off, reducing latency and cloud API usage while maintaining competitive accuracy against strong structured reasoning baselines.
翻译:边缘-云协同推理正逐渐成为大语言模型赋能边缘设备的实际需求:设备端模型通常无法满足所需的推理能力,而纯云端推理在严格的延迟与token/API预算约束下可能成本过高且速度缓慢。然而,现有的边缘-云协同方法通常基于预估难度对每个查询或固定步骤进行简单路由。此类粗粒度静态启发式策略忽视了子任务间的依赖关系,错失了并行执行与预算自适应路由的优化机会。为此,我们提出\textbf{HybridFlow}——一种资源自适应的边缘-云推理框架,其具备以下特性:(i)为每个查询构建依赖感知的有向无环图,并行执行新解锁的子任务,从而降低端到端延迟;(ii)通过学习的效益-成本效用模型在线将每个子任务路由至边缘或云端,动态权衡精度提升与token/API及延迟预算,从而在保持推理质量的同时减少不必要的云端资源消耗。在GPQA、MMLU-Pro、AIME24和LiveBench-Reasoning等基准测试中,HybridFlow显著优化了成本-精度权衡,在保持与强结构化推理基线模型相当精度的同时,有效降低了延迟与云端API使用量。