Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.
翻译:大语言模型和视觉语言模型越来越多地通过布局和场景图等中间结构生成室内场景,但评估仍依赖于这些模型对渲染图像进行评分,使得判断对视角、提示措辞和幻觉敏感。当评估器不稳定时,很难确定模型是否生成了空间合理的场景,或输出分数是否反映了视角、渲染或提示的选择。我们提出了SceneCritic,一种面向平面图级布局的符号评估器。SceneCritic的约束基于SceneOnto构建,这是一种通过聚合3D-FRONT、ScanNet和Visual Genome中的室内场景先验信息而构建的结构化空间本体。SceneOnto遍历该本体以联合验证对象关系之间的语义、朝向和几何一致性,提供对象级和关系级评估,识别特定违反项和成功放置项。此外,我们将SceneCritic与迭代优化测试平台配对,该平台探索模型在不同批评者模态下如何构建和修正空间结构:基于规则的批评者使用碰撞约束作为反馈、基于布局文本的LLM批评者,以及基于渲染观测的VLM批评者。通过大量实验,我们证明:(a) SceneCritic与人类判断的一致性显著优于基于VLM的评估器;(b) 纯文本LLM在语义布局质量上可超越VLM;(c) 基于图像的VLM优化是语义和朝向修正最有效的批评者模态。