The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving the gap between current LVLMs and qualified embodied intelligence unknown. Therefore, we construct EmbSpatial-Bench, a benchmark for evaluating embodied spatial understanding of LVLMs.The benchmark is automatically derived from embodied scenes and covers 6 spatial relationships from an egocentric perspective.Experiments expose the insufficient capacity of current LVLMs (even GPT-4V). We further present EmbSpatial-SFT, an instruction-tuning dataset designed to improve LVLMs' embodied spatial understanding.
翻译:近期大型视觉语言模型的快速发展,已显示出其在具身任务中的潜力。然而,模型在具身环境中的关键技能——空间理解能力尚未得到系统评估,致使当前大型视觉语言模型与合格的具身智能体之间的能力差距尚不明确。为此,我们构建了EmbSpatial-Bench基准测试,用于评估大型视觉语言模型的具身空间理解能力。该基准测试自动生成于具身场景,涵盖以自我为中心视角的六种空间关系。实验揭示了当前大型视觉语言模型(包括GPT-4V)在此方面的能力不足。我们进一步提出了EmbSpatial-SFT指令微调数据集,旨在提升大型视觉语言模型的具身空间理解能力。