Detecting Non-Optimal Decisions of Embodied Agents via Diversity-Guided Metamorphic Testing

As embodied agents advance toward real-world deployment, ensuring optimal decisions becomes critical for resource-constrained applications. Current evaluation methods focus primarily on functional correctness, overlooking the non-functional optimality of generated plans. This gap can lead to significant performance degradation and resource waste. We identify and formalize the problem of Non-optimal Decisions (NoDs), where agents complete tasks successfully but inefficiently. We present NoD-DGMT, a systematic framework for detecting NoDs in embodied agent task planning via diversity-guided metamorphic testing. Our key insight is that optimal planners should exhibit invariant behavioral properties under specific transformations. We design four novel metamorphic relations capturing fundamental optimality properties: position detour suboptimality, action optimality completeness, condition refinement monotonicity, and scene perturbation invariance. To maximize detection efficiency, we introduce a diversity-guided selection strategy that actively selects test cases exploring different violation categories, avoiding redundant evaluations while ensuring comprehensive diversity coverage. Extensive experiments on the AI2-THOR simulator with four state-of-the-art planning models demonstrate that NoD-DGMT achieves violation detection rates of 31.9% on average, with our diversity-guided filter improving rates by 4.3% and diversity scores by 3.3 on average. NoD-DGMT significantly outperforms six baseline methods, with 16.8% relative improvement over the best baseline, and demonstrates consistent superiority across different model architectures and task complexities.

翻译：随着具身智能体向现实世界部署迈进，确保其做出最优决策对于资源受限的应用变得至关重要。当前的评估方法主要关注功能正确性，而忽视了所生成计划的非功能性最优性。这一差距可能导致显著的性能下降和资源浪费。我们识别并形式化了非最优决策问题，即智能体虽成功完成任务但效率低下。我们提出了NoD-DGMT，一个通过多样性引导的蜕变测试来系统性地检测具身智能体任务规划中非最优决策的框架。我们的核心洞见是，最优规划器在特定变换下应表现出不变的行为属性。我们设计了四种新颖的蜕变关系，以捕捉基本的最优性属性：位置绕行次优性、动作最优性完备性、条件细化单调性和场景扰动不变性。为了最大化检测效率，我们引入了一种多样性引导的选择策略，该策略主动选择探索不同违规类别的测试用例，在确保全面多样性覆盖的同时避免冗余评估。在AI2-THOR模拟器上对四种最先进的规划模型进行的大量实验表明，NoD-DGMT平均实现了31.9%的违规检测率，其中我们的多样性引导过滤器将检测率平均提高了4.3%，多样性得分平均提高了3.3。NoD-DGMT显著优于六种基线方法，相对于最佳基线有16.8%的相对提升，并在不同模型架构和任务复杂度上展现出一致的优越性。