DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

翻译：基于扩散的图像压缩方法近期展现出卓越的感知保真度，但其实际应用受到采样开销巨大和内存占用过高的阻碍。现有大多数扩散编解码器采用U-Net架构，其层级下采样迫使扩散过程在浅层潜在空间（通常仅8倍空间下采样）中进行，导致计算量过大。相比之下，传统的基于VAE的编解码器在更深层的潜在域（16倍至64倍下采样）中工作，这引出了一个关键问题：扩散过程能否在此类紧凑的潜在空间中有效运行，且不牺牲重建质量？为此，我们提出了DiT-IC，一种面向图像压缩的对齐扩散Transformer。它用扩散Transformer替代U-Net，能够在完全32倍下采样分辨率的潜在空间中执行扩散。DiT-IC通过三种关键对齐机制，将预训练的多步文本到图像DiT模型适配为单步重建模型：（1）方差引导重建流，根据潜在不确定性调整去噪强度，实现高效重建；（2）自蒸馏对齐，强制模型与编码器定义的潜在几何结构保持一致，实现一步扩散；（3）潜在条件引导，用语义对齐的潜在条件替代文本提示，实现无需文本的推理。通过这些设计，DiT-IC在达到最先进感知质量的同时，相比现有基于扩散的编解码器，解码速度提升高达30倍，且内存占用大幅降低。值得注意的是，它能够在16GB笔记本电脑GPU上重建2048x2048分辨率的图像。