Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/
翻译:视觉推理——即理解视觉世界的能力——对于在三维场景中运行的具身智能体至关重要。人工智能的进步催生了能够根据图像回答问题的视觉与语言模型。然而,当面临三维空间推理任务时,其性能会下降。为应对此类推理问题的复杂性,我们引入了一种智能体程序合成方法,其中多个LLM智能体协作生成一个Python风格的API,该API包含用于解决常见子问题的新函数。我们的方法克服了先前依赖静态、人工定义API的局限性,从而能够处理更广泛的查询。为了评估AI在三维理解方面的能力,我们引入了一个新的基准测试集,其中包含涉及多步语义落地与推理的查询。我们证明,在三维视觉推理任务上,我们的方法优于先前的零样本模型,并通过实验验证了我们的智能体框架在三维空间推理任务中的有效性。项目网站:https://glab-caltech.github.io/vadar/