LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

翻译：尽管利用大型语言模型（LLM）自动生成3D场景的研究已取得进展，但生成的场景往往缺乏真实环境中的合理空间布局与物体属性。该问题源于现有指令的细节不足与粒度粗糙，因此推进基于更详细、细粒度指令的3D场景合成技术至关重要，此类指令需能反映真实环境特征。若缺乏真实场景，在非真实环境中训练具身智能体可能导致其学习到与现实世界物理规则及语义显著偏离的先验知识，从而在实际部署时性能下降。因此，验证细粒度指令与生成场景之间的对齐程度对于有效学习至关重要。然而，现有评估方法（如CLIPScore和视觉语言模型）往往难以可靠评估此类对齐关系，主要源于其对3D场景的理解较为浅层，常导致场景组件定位失准。为此，我们提出LEGO-Eval评估框架，该框架配备多样化工具以显式定位场景组件，从而实现更精准的对齐评估。同时，我们构建LEGO-Bench基准数据集，包含针对真实环境复杂布局与属性的详细指令集。实验表明，在场景-指令对齐评估任务中，LEGO-Eval相较于VLM-as-a-judge方法在F1分数上提升0.41。基于LEGO-Bench的测试揭示了当前生成方法的显著局限：在所有评估方法中，生成场景与细粒度指令完全对齐的成功率最高仅达10%。