Low Earth orbit (LEO) satellites play an essential role in intelligent Earth observation by leveraging artificial intelligence models. However, limited onboard memory and excessive inference delay prevent the practical deployment of large language models (LLMs) on a single satellite. In this paper, we propose a communication-efficient collaborative LLM inference scheme for LEO satellite networks. Specifically, the entire LLM is split into multiple sub-models, with each deployed on a satellite, thereby enabling collaborative LLM inference via exchanging intermediate activations between satellites. The proposed scheme also leverages the pipeline parallelism mechanism that overlaps sub-model inference with intermediate activation transmission, thereby reducing LLM inference delay. An adaptive activation compression scheme is designed to mitigate cumulative errors from multi-stage model splitting while preserving inference accuracy. Furthermore, we formulate the LLM inference delay minimization problem by jointly optimizing model splitting and compression ratios under onboard memory and inference accuracy constraints. The problem is transformed into a shortest-path search problem over a directed acyclic graph that edge weights explicitly quantify the inference delay induced by model splitting and compression strategies, which is solved via a modified A Star-based search algorithm. Extensive simulation results indicate that the proposed solution can reduce inference delay by up to 42% and communication overhead by up to 71% compared to state-of-the-art benchmarks, while maintaining the inference accuracy loss of less than 1%.
翻译:低地球轨道(LEO)卫星通过利用人工智能模型,在智能地球观测中发挥着重要作用。然而,有限的内存容量和过高的推理延迟阻碍了大语言模型(LLM)在单颗卫星上的实际部署。本文提出了一种面向LEO卫星网络的通信高效协同LLM推理方案。具体而言,将整个LLM分割为多个子模型,每个子模型部署在一颗卫星上,通过卫星间交换中间激活值实现协同LLM推理。该方案还利用了流水线并行机制,将子模型推理与中间激活值传输重叠进行,从而降低LLM推理延迟。设计了一种自适应激活压缩方案,以减轻多阶段模型分割带来的累积误差,同时保持推理精度。此外,我们通过联合优化内存和推理精度约束下的模型分割与压缩比,建立了LLM推理延迟最小化问题。该问题被转化为有向无环图上的最短路径搜索问题,其中边权重明确量化了模型分割与压缩策略引起的推理延迟,并通过改进的A星搜索算法求解。大量仿真结果表明,与现有最优基准相比,所提方案在推理精度损失低于1%的条件下,能将推理延迟降低最高42%,通信开销降低最高71%。