Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.
翻译:理解用户指令与周围环境中物体的空间关系,对于智能机器人系统协助人类完成各类任务至关重要。视觉语言模型(VLMs)的自然语言与空间推理能力,有望提升机器人规划器在新任务、新物体及新运动规范上的泛化能力。尽管基础模型已在任务规划中得到应用,但其是否具备执行用户对运动的偏好或约束(如期望与物体的距离、拓扑属性或运动风格偏好)所需的空间推理能力,目前尚不明确。本文通过四种不同的查询方法,评估了四种先进视觉语言模型在机器人运动空间推理方面的能力。结果显示,在性能最佳的查询方法下,Qwen2.5-VL 的零样本准确率达到 71.4%,较小模型经微调后可达 75%,而 GPT-4o 的表现则相对较低。我们评估了两种运动偏好类型(物体接近度与路径风格),并分析了准确性与计算代价(以 token 数量衡量)之间的权衡。这项工作初步展示了视觉语言模型与机器人运动规划流程融合的潜力。