Large language models (LLMs) exhibit impressive reasoning and problem-solving abilities, yet their substantial inference latency and token consumption pose major challenges for real-time deployment on resource-limited edge devices. Recent efforts toward edge-cloud collaboration have attempted to mitigate this issue, but most existing methods adopt coarse-grained task allocation strategies-assigning entire queries either to the edge or the cloud. Such rigid partitioning fails to exploit fine-grained reasoning parallelism and often leads to redundant computation and inefficient resource utilization. To this end, we propose HybridFlow, a resource-adaptive inference framework that enables fast and token-efficient collaborative reasoning between edge and cloud LLMs. HybridFlow operates in two stages: (1) task decomposition and parallel execution, which dynamically splits a complex query into interdependent subtasks that can execute as soon as their dependencies are resolved; and (2) resource-aware subtask routing, where a learned router adaptively assigns each subtask to the edge or cloud model according to predicted utility gains and real-time budget states. Comprehensive evaluations on GPQA, MMLU-Pro, AIME, and LiveBench-Reasoning demonstrate that HybridFlow effectively reduces end-to-end inference time and overall token usage while maintaining competitive accuracy.
翻译:大语言模型展现出令人印象深刻的推理与问题解决能力,但其巨大的推理延迟和令牌消耗对在资源受限的边缘设备上进行实时部署构成了主要挑战。近期针对边缘-云协同的努力试图缓解这一问题,但大多数现有方法采用粗粒度的任务分配策略——将整个查询完全分配给边缘或云端。这种僵硬的划分方式未能利用细粒度的推理并行性,常常导致冗余计算和低效的资源利用。为此,我们提出了混合流,一种资源自适应的推理框架,能够在边缘和云端大语言模型之间实现快速且令牌高效的协同推理。混合流在两个阶段运行:(1) 任务分解与并行执行,该阶段动态地将复杂查询拆分为相互依赖的子任务,这些子任务在其依赖关系得到解决后即可立即执行;(2) 资源感知的子任务路由,其中一个学习型路由器根据预测的效用增益和实时预算状态,自适应地将每个子任务分配给边缘或云端模型。在GPQA、MMLU-Pro、AIME和LiveBench-Reasoning上的综合评估表明,混合流在保持竞争力的准确性的同时,有效降低了端到端推理时间和总体令牌使用量。