Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
翻译:大型语言模型(LLMs)和视觉语言模型(VLMs)展现出令人瞩目的推理能力,但在细粒度视觉编辑任务中仍面临空间理解与布局一致性不足的挑战。我们提出一种结构化推理框架,通过场景图推理实现文本条件驱动的空间布局编辑。给定输入场景图与自然语言指令,模型在场景图上进行推理,生成满足文本条件同时保持空间连贯性的更新场景图。通过结构化关系表示显式引导推理过程,我们的方法增强了对空间关系的可解释性和可控性。我们在包含排序、空间对齐和房间编辑任务的新文本引导布局编辑基准上评估该方法。与思维链微调(CoT-SFT)和原始GRPO基线相比,我们的训练范式使IoU平均提升15%,中心距离误差降低25%。相较于现有最优零样本LLMs,我们的最佳模型mIoU提升高达20%,显著改进了空间精度。