Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.
翻译:扩散Transformer(DiT)架构显著推动了文本到图像(T2I)生成技术的发展,但其计算成本高昂,部署门槛较高。为应对这些挑战,我们提出了一种高效的压缩框架,可将基于60层双流MMDiT的Qwen-Image转化为轻量级模型,而无需从头开始训练。基于此框架,我们推出了Amber-Image系列精简T2I模型。我们首先采用时间步敏感的深度剪枝策略,通过局部权重平均重新初始化保留层,并借助逐层蒸馏与全参数微调进行优化,从而得到Amber-Image-10B模型。在此基础上,我们进一步引入混合流架构,将深层双流转换为从图像分支初始化的单流结构,并通过渐进式蒸馏与轻量级微调进行精炼,开发出Amber-Image-6B模型。该方法将参数量减少70%,且无需大规模数据工程。值得注意的是,从10B到6B变体的完整压缩与训练流程仅需不到2000 GPU小时,相较于从头训练展现出卓越的成本效益。在DPG-Bench和LongText-Bench等基准测试上的广泛评估表明,Amber-Image能够实现高保真度合成与卓越的文本渲染效果,其性能可与规模大得多的模型相媲美。