Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.
翻译:扩散Transformer(DiTs)的最新进展为图像生成树立了新标杆,但由于其高昂的计算和内存成本,仍难以在实际设备上部署。本研究提出了一种专为移动和边缘设备设计的高效DiT框架,在严格的资源限制下实现了Transformer级别的生成质量。我们的设计结合了三个关键组成部分。首先,我们提出了一种紧凑的DiT架构,采用自适应全局-局部稀疏注意力机制,以平衡全局上下文建模与局部细节保留。其次,我们提出了一种弹性训练框架,在统一的超网络内联合优化不同容量的子DiT,使得单个模型能够动态调整,以适应不同硬件的高效推理。最后,我们开发了知识引导分布匹配蒸馏,这是一种将DMD目标与少步教师模型的知识转移相结合的步进蒸馏流程,能够生成适用于实时设备端使用的高保真、低延迟图像(例如4步生成)。这些贡献共同实现了可扩展、高效且高质量的扩散模型,能够部署于多样化的硬件平台。