Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.
翻译:具身人工智能要求智能体在模拟环境中理解目标、规划动作并执行任务。本研究采用具身智能体接口(EAI)框架,在VirtualHome基准上对大规模语言模型(LLMs)进行了系统性评估。我们比较了两个代表性的70亿参数模型OPENPANGU-7B与QWEN2.5-7B在四项基础任务上的表现:目标解析、动作序列生成、子目标分解和状态转移建模。本文提出结构化自洽性(SSC)解码策略,该策略通过多轮采样结合领域特定投票机制来提升结构化生成任务的输出质量。实验结果表明,SSC能显著提升模型性能:OPENPANGU-7B在层次化规划任务中表现突出,而QWEN2.5-7B在动作级任务中展现优势。我们的分析揭示了不同模型类型的互补特性,为未来具身AI系统开发提供了重要启示。