The deployment of large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the significance of neuron groups prior to activation. (2) a parallel execution scheme of neuron group loading during the model inference. (3) a load-balancing scheduler that efficiently orchestrates multiple devices with heterogeneous resources. Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions. It maintains performance comparable to optimal conditions and significantly outperforms the state-of-the-art in various scenarios.
翻译:在边缘部署大语言模型(LLMs)推理能够提升服务响应速度,同时保护用户隐私。然而,单个边缘节点的资源限制构成了严峻挑战。分布式推理应运而生,旨在聚合并利用多个设备的计算资源。然而,现有方法通常要求严格的同步,这在不可靠的网络条件下往往难以实现。本文提出HALO,一种新颖的框架,旨在提升丢包边缘网络中的分布式LLM推理性能。其核心思想是通过策略性地将重要性较低的神经元组分配给不稳定的设备,从而实现一种宽松但有效的同步,避免因数据包延迟导致的过度等待时间。HALO引入了三个关键机制:(1)一个语义感知预测器,用于在激活前评估神经元组的重要性;(2)在模型推理过程中并行执行神经元组加载的方案;(3)一个负载均衡调度器,能够高效协调具有异构资源的多个设备。在树莓派集群上的实验结果表明,在不可靠的网络条件下,HALO为LLaMA系列大语言模型实现了3.41倍的端到端加速。其性能与最优条件相当,并在多种场景下显著优于现有最优方法。