While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.
翻译:尽管表征与自监督模型的对齐已被证明能够改进扩散模型的训练,但其在增强推理时条件化方面的潜力在很大程度上仍未得到探索。我们提出了表征对齐引导(REPA-G)框架,该框架利用这些具有丰富语义特性的对齐表征,实现生成过程中基于特征的测试时条件化。通过在推理时优化相似性目标(势函数),我们引导去噪过程朝向从预训练特征提取器中提取的条件化表征。我们的方法在多个尺度上提供了灵活的控制,范围从通过单块图像块实现的细粒度纹理匹配,到使用全局图像特征标记的广泛语义引导。我们进一步将其扩展到多概念组合,允许忠实融合不同概念。REPA-G完全在推理时运行,为通常模糊的文本提示或粗糙的类别标签提供了一种灵活且精确的替代方案。我们从理论上论证了这种引导如何实现从势函数诱导的倾斜分布中采样。在ImageNet和COCO数据集上的定量结果表明,我们的方法能够实现高质量、多样化的生成。代码发布于 https://github.com/valeoai/REPA-G。