Text-to-image diffusion models are increasingly deployed at the network edge to serve heterogeneous workloads with diverse quality and latency requirements. However, existing deployment strategies choose either large edge-side models with high fidelity but high latency or lightweight device-side models that offer speed at the cost of semantic coherence. Moreover, these approaches rarely split the denoising workload between models of different sizes across edge servers and user devices. To bridge this gap, we propose RISE, a method for edge-device diffusion model services that combines relay inference with online scheduling. Driven by the finding that the latent intensity exhibits minimal deviation after a model handoff, RISE uses a training-free relay mechanism that exploits the shared latent space within a model family: the large model on the edge handles the early denoising steps that shape semantic structure, then passes the intermediate latent to a small device-side model for detail refinement. To deploy this mechanism as a practical service, a contextual bandit scheduler selects the best relay configuration based on prompt complexity, user preferences, network quality and real-time node loads. Experiments on two benchmarks show that RISE's relay mechanism achieves up to 2.1$\times$ speedup while preserving full-model quality, and its context-aware scheduler effectively balances quality and latency under mixed workloads.
翻译:文本到图像扩散模型正越来越多地部署在网络边缘,以承载具有多样化质量和延迟需求的异构工作负载。然而,现有的部署策略要么选择具有高保真度但高延迟的大规模边缘侧模型,要么选择以牺牲语义连贯性为代价换取速度的轻量级设备侧模型。此外,这些方法很少在边缘服务器和用户设备之间,跨不同大小的模型分配去噪工作负载。为弥合这一差距,我们提出RISE,一种将接力推理与在线调度相结合的边缘-设备扩散模型服务方法。受模型交接后潜在强度变化微乎其微这一发现的驱动,RISE采用了一种免训练的接力机制,该机制利用了模型族内的共享潜在空间:边缘侧的大模型处理塑造语义结构的早期去噪步骤,然后将中间潜在变量传递给设备侧的小模型进行细节精炼。为将这一机制部署为实用服务,一个上下文赌博调度器根据提示复杂性、用户偏好、网络质量和实时节点负载来选择最佳的接力配置。在两个基准上的实验表明,RISE的接力机制实现了高达2.1倍的加速,同时保持了完整模型的质量,且其上下文感知调度器在混合工作负载下有效平衡了质量与延迟。