Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a key obstacle for diffusion transformers in addressing this problem is the mismatch between positional encodings seen at inference and those used during training. Existing strategies such as positional encodings interpolation, extrapolation, or hybrids, do not fully resolve this mismatch. In this paper, we propose a novel two-dimensional randomized positional encodings, namely RPE-2D, that prioritizes the order of image patches rather than their absolute distances, enabling seamless high- and low-resolution generation without training on multiple resolutions. Concretely, RPE-2D independently samples positions along the horizontal and vertical axes over an expanded range during training, ensuring that the encodings used at inference lie within the training distribution and thereby improving resolution generalization. We further introduce a simple random resize-and-crop augmentation to strengthen order modeling and add micro-conditioning to indicate the applied cropping pattern. On the ImageNet dataset, RPE-2D achieves state-of-the-art resolution generalization performance, outperforming competitive methods when trained at $256^2$ and evaluated at $384^2$ and $512^2$, and when trained at $512^2$ and evaluated at $768^2$ and $1024^2$. RPE-2D also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration, and multi-resolution inheritance.
翻译:图像生成任务中的分辨率泛化能力使得能够以较低训练分辨率开销生成更高分辨率图像。然而,扩散Transformer解决此问题的关键障碍在于推理时所见位置编码与训练时所用位置编码之间的不匹配。现有策略如位置编码插值、外推或混合方法均未能完全解决此不匹配问题。本文提出一种新颖的二维随机化位置编码方法RPE-2D,该方法优先考虑图像块序列顺序而非其绝对距离,从而无需在多分辨率上训练即可实现无缝高分辨率与低分辨率生成。具体而言,RPE-2D在训练期间沿水平与垂直轴在扩展范围内独立采样位置,确保推理时使用的编码处于训练分布范围内,进而提升分辨率泛化能力。我们进一步引入简单的随机缩放裁剪增强来强化序列建模,并添加微条件机制以指示所应用的裁剪模式。在ImageNet数据集上,RPE-2D实现了最先进的分辨率泛化性能:在$256^2$分辨率训练、$384^2$和$512^2$分辨率评估时,以及在$512^2$分辨率训练、$768^2$和$1024^2$分辨率评估时,均超越现有竞争方法。RPE-2D还在低分辨率图像生成、多阶段训练加速及多分辨率继承方面展现出卓越能力。