Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.
翻译:得益于预训练的文本到图像(T2I)扩散模型,真实世界图像超分辨率(Real-ISR)方法能够合成丰富且逼真的细节。然而,由于T2I模型固有的随机性,不同的噪声输入往往导致具有不同感知质量的输出。尽管这种随机性有时被视为一种局限,但它也引入了更广的感知质量范围,可用于提升Real-ISR性能。为此,我们提出了面向Real-ISR的直接感知偏好优化(DP$^2$O-SR),这是一个无需昂贵人工标注即可将生成模型与感知偏好对齐的框架。我们通过结合在大规模人类偏好数据集上训练的全参考和无参考图像质量评估(IQA)模型,构建了一种混合奖励信号。该奖励同时鼓励结构保真度和自然外观。为了更好地利用感知多样性,我们超越了标准的最佳-最差选择方法,从同一模型的输出中构建多个偏好对。我们的分析表明,最优的选择比例取决于模型容量:较小的模型受益于更广的覆盖范围,而较大的模型对监督中更强的对比响应更好。此外,我们提出了分层偏好优化,该方法基于组内奖励差距和组间多样性自适应地加权训练对,从而实现更高效和稳定的学习。在基于扩散和基于流的T2I骨干网络上进行的广泛实验表明,DP$^2$O-SR显著提升了感知质量,并能很好地泛化到真实世界基准测试中。