To support on-device inference, the next-generation mobile networks are expected to support real-time model downloading services to mobile users. However, powerful AI models typically have large model sizes, resulting in excessive end-to-end (E2E) downloading-and-inference (DAI) latency. To address this issue, we propose a simultaneous model downloading and inference (SLIDE) framework, which allows users to perform inference with downloaded layers while simultaneously receiving the remaining layers of the model. To this end, we formulate a task throughput maximization problem by jointly optimizing model provisioning, spectrum bandwidth allocation, and computing resource allocation for multi-user downlink systems. Unlike traditional DAI frameworks, SLIDE introduces recursive dependencies across layers, where inference latency depends recursively on the downloading bandwidth and computing resource allocation for each of the preceding layers. To solve this challenging problem, we design an efficient algorithm that acquires the optimal solution with polynomial-time complexity. Simulation results demonstrate that the proposed SLIDE framework significantly improves task throughput under latency and communication resource constraints compared with the conventional model downloading schemes.
翻译:为支持设备端推理,下一代移动网络需向移动用户提供实时模型下载服务。然而,高性能AI模型通常具有庞大的模型规模,导致端到端下载-推理时延过高。针对该问题,本文提出同步模型下载与推理框架SLIDE,该框架允许用户在下载模型剩余层的同时,利用已下载层进行推理。为此,我们通过联合优化多用户下行链路中的模型供给、频谱带宽分配与计算资源分配,构建任务吞吐量最大化问题。与传统DAI框架不同,SLIDE引入跨层递归依赖关系:推理时延递归取决于前序各层的下载带宽与计算资源分配。为求解该难题,我们设计了一种可在多项式时间内获得最优解的高效算法。仿真结果表明,与传统模型下载方案相比,所提SLIDE框架在时延与通信资源约束下显著提升了任务吞吐量。