While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
翻译:摘要:尽管文本到图像模型在视觉保真度方面取得了显著进展,但在忠实实现复杂视觉意图方面仍面临挑战,因为许多需求需要在基础、生成和验证过程中全程追踪。我们将这些需求称为语义承诺,并将其生命周期中的不连续性概念化为“概念鸿沟”——承诺可能在局部被解决或检验,但无法在整个生成生命周期中保持为可识别的统一操作单元。为此,我们提出SCOPE——一种规范引导的技能编排框架,该框架在动态演化的结构化规范中维护语义承诺,并针对未解决或违反的承诺有条件地调用检索、推理和修复技能。为评估承诺层面的意图实现,我们引入Gen-Arena——一个带有实体级和约束级规范的人工标注基准,以及实体门控意图通过率(EGIP)——一种严格的实体优先通过准则。SCOPE在Gen-Arena上显著超越所有基线模型,达到0.60 EGIP,并在WISE-V(0.907)和MindBench(0.61)上取得强劲结果,证明了持久承诺追踪对复杂图像生成的有效性。