Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.
翻译:近期研究表明,扩散模型通过直接在SSL(自监督学习)图像块特征而非像素空间隐变量上操作,能够生成高质量图像。然而,来自DINOv2等编码器的密集图像块网格包含显著冗余,导致扩散过程计算开销不必要地增大。本文提出FlatDINO——一种变分自编码器,可将该表示压缩为仅包含32个连续标记的一维序列,实现序列长度降低8倍、总维度压缩48倍。在ImageNet 256×256数据集上,基于FlatDINO隐变量训练的DiT-XL模型在使用无分类器引导时达到1.80的gFID指标,且相比在未压缩DINOv2特征上的扩散模型,其单次前向传播所需FLOPs减少8倍,单训练步所需FLOPs最多减少4.5倍。以上为初步结果,本研究仍在进行中。