Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.
翻译:基于扩散的图像压缩方法近期展现出卓越的感知保真度,但其实际应用受到采样开销巨大和内存占用过高的阻碍。现有大多数扩散编解码器采用U-Net架构,其层级下采样迫使扩散过程在浅层潜在空间(通常仅8倍空间下采样)中进行,导致计算量过大。相比之下,传统的基于VAE的编解码器在更深层的潜在域(16倍至64倍下采样)中工作,这引出了一个关键问题:扩散过程能否在此类紧凑的潜在空间中有效运行,且不牺牲重建质量?为此,我们提出了DiT-IC,一种面向图像压缩的对齐扩散Transformer。它用扩散Transformer替代U-Net,能够在完全32倍下采样分辨率的潜在空间中执行扩散。DiT-IC通过三种关键对齐机制,将预训练的多步文本到图像DiT模型适配为单步重建模型:(1)方差引导重建流,根据潜在不确定性调整去噪强度,实现高效重建;(2)自蒸馏对齐,强制模型与编码器定义的潜在几何结构保持一致,实现一步扩散;(3)潜在条件引导,用语义对齐的潜在条件替代文本提示,实现无需文本的推理。通过这些设计,DiT-IC在达到最先进感知质量的同时,相比现有基于扩散的编解码器,解码速度提升高达30倍,且内存占用大幅降低。值得注意的是,它能够在16GB笔记本电脑GPU上重建2048x2048分辨率的图像。