CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,Shayan Baghayi Nejad,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

from arxiv, Accepted at TMLR (2026)

Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/.

翻译：文本到图像扩散模型（如Stable Diffusion）能够生成高质量且多样化的图像，但在实现组合对齐方面往往存在不足，尤其是在提示词描述复杂物体关系、属性或空间布局时。近期的推理时方法通过优化或探索初始噪声来解决此问题，这些方法在奖励函数（用于评估图文对齐程度且无需模型微调）的指导下进行。尽管前景可观，但每种策略单独使用时都存在固有局限：优化可能因初始化不良或搜索轨迹不利而陷入停滞，而探索则可能需要大量样本才能找到满意输出。我们的分析进一步表明，单一奖励指标或临时组合均无法可靠捕捉组合性的所有方面，导致引导效果薄弱或不一致。为克服这些挑战，我们提出了基于类别感知奖励的初始噪声优化与探索（CARINOX）框架，该统一框架将噪声优化与探索相结合，并采用基于人类判断相关性的原则性奖励选择机制。在两个涵盖多样化组合挑战的互补基准测试（T2I-CompBench++和HRS基准）上的评估表明，CARINOX将平均对齐分数分别提升了16%和11%，在所有主要类别中均持续优于基于优化和探索的先进方法，同时保持了图像质量与多样性。项目页面详见 https://amirkasaei.com/carinox/。