Simulated environments play an essential role in embodied AI, functionally analogous to test cases in software engineering. However, existing environment generation methods often emphasize visual realism (e.g., object diversity and layout coherence), overlooking a crucial aspect: logical diversity from the testing perspective. This limits the comprehensive evaluation of agent adaptability and planning robustness in distinct simulated environments. To bridge this gap, we propose LogicEnvGen, a novel method driven by Large Language Models (LLMs) that adopts a top-down paradigm to generate logically diverse simulated environments as test cases for agents. Given an agent task, LogicEnvGen first analyzes its execution logic to construct decision-tree-structured behavior plans and then synthesizes a set of logical trajectories. Subsequently, it adopts a heuristic algorithm to refine the trajectory set, reducing redundant simulation. For each logical trajectory, which represents a potential task situation, LogicEnvGen correspondingly instantiates a concrete environment. Notably, it employs constraint solving for physical plausibility. Furthermore, we introduce LogicEnvEval, a novel benchmark comprising four quantitative metrics for environment evaluation. Experimental results verify the lack of logical diversity in baselines and demonstrate that LogicEnvGen achieves 1.04-2.61x greater diversity, significantly improving the performance in revealing agent faults by 4.00%-68.00%.
翻译:模拟环境在具身AI中扮演着至关重要的角色,其功能类似于软件工程中的测试用例。然而,现有的环境生成方法通常强调视觉真实性(例如物体多样性和布局一致性),却忽视了一个关键方面:从测试视角出发的逻辑多样性。这限制了对智能体在不同模拟环境中适应性和规划鲁棒性的全面评估。为弥补这一不足,我们提出了LogicEnvGen,一种由大语言模型驱动的新方法,采用自上而下的范式,为智能体生成逻辑多样化的模拟环境作为测试用例。给定一个智能体任务,LogicEnvGen首先分析其执行逻辑,构建决策树结构的行为计划,进而合成一组逻辑轨迹。随后,它采用启发式算法对轨迹集进行优化,以减少冗余模拟。对于每条代表潜在任务情境的逻辑轨迹,LogicEnvGen相应地实例化一个具体环境。值得注意的是,该方法采用约束求解来确保物理合理性。此外,我们引入了LogicEnvEval,一个包含四个量化评估指标的新型基准。实验结果验证了基线方法在逻辑多样性上的不足,并表明LogicEnvGen实现了1.04-2.61倍的多样性提升,在揭示智能体缺陷方面的性能显著提高了4.00%-68.00%。