Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements-including object counts, attributes, and spatial relationships-and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.
翻译:尽管文本条件三维室内场景生成领域近期取得了进展,但这些方法的评估仍存在不足。现有指标通常通过将生成场景与一组真实场景进行比较来衡量其真实性,但它们忽略了场景对输入文本的遵循程度以及对合理性隐含期望的捕捉能力。我们提出了SceneEval评估框架,旨在解决这些局限性。SceneEval引入了针对显式用户需求(包括物体数量、属性和空间关系)的细粒度指标,以及针对支撑关系、碰撞和可通行性等隐含期望的补充指标。这些指标共同提供了对场景质量的可解释且全面的评估。为夯实评估基础,我们构建了SceneEval-500基准数据集,包含500条文本描述及对应场景属性的详细标注。该数据集为不同场景生成方法之间可复现的系统性比较建立了共同参照标准。我们使用SceneEval对六种近期场景生成方法进行了评估,证明了该框架能够对生成场景提供多维度精细评估,清晰揭示各方法的优势与待改进领域。实验结果揭示了当前方法存在的显著不足,强调了为实现实用可控的场景合成仍需进一步深入研究。