AI WiFi offload is emerging as a promising approach for providing large language model (LLM) services to resource-constrained wireless devices. However, unlike conventional edge computing, LLM inference over WiFi must jointly address heterogeneous model capabilities, wireless contention, uncertain task complexity, and semantic correlation among reasoning tasks. In this paper, we investigate LLM inference offloading in a multi-user multi-edge WiFi network, where each task can be executed locally, directly offloaded to a nearby edge access point (AP), or decomposed into multiple subtasks for collaborative execution across local and edge nodes. To this end, we propose a user-edge collaborative framework with an LLM-based planner that not only performs task decomposition but also infers subtask difficulty and expected output token length, enabling more accurate estimation of execution quality and latency on heterogeneous nodes. Based on these estimates, we further design a decomposition-aware scheduling strategy that jointly optimizes subtask assignment, execution, and aggregation under communication, queuing, and computation constraints. Simulation results show that the proposed framework achieves a better latency-accuracy tradeoff than local-only and nearest-edge baselines, reducing the average latency by $20\%$ and improving the overall reward by $80\%$. Moreover, the distilled lightweight planner approaches the performance of the large teacher model while remaining more suitable for practical edge deployment.
翻译:AI WiFi卸载正成为向资源受限无线设备提供大语言模型(LLM)服务的一种有前景的方法。然而,与传统的边缘计算不同,WiFi上的LLM推理必须同时应对异构模型能力、无线竞争、不确定的任务复杂度以及推理任务间的语义关联性。在本文中,我们研究了一个多用户多边缘WiFi网络中的LLM推理卸载问题,其中每个任务可以本地执行、直接卸载到附近边缘接入点(AP),或分解为多个子任务以在本地和边缘节点之间协作执行。为此,我们提出了一种用户-边缘协作框架,该框架包含一个基于LLM的规划器,不仅执行任务分解,还能推断子任务难度和期望输出令牌长度,从而更精确地估计异构节点上的执行质量和延迟。基于这些估计,我们进一步设计了一种分解感知调度策略,该策略在通信、排队和计算约束下联合优化子任务分配、执行与聚合。仿真结果表明,所提出的框架相比纯本地和最近边缘基线实现了更好的延迟-精度权衡,平均延迟降低了20%,总体奖励提升了80%。此外,轻量级蒸馏规划器的性能接近大型教师模型,同时更适合实际边缘部署。