Minigolf is an exemplary real-world game for examining embodied intelligence, requiring challenging spatial and kinodynamic understanding to putt the ball. Additionally, reflective reasoning is required if the feasibility of a challenge is not ensured. We introduce RoboGolf, a VLM-based framework that combines dual-camera perception with closed-loop action refinement, augmented by a reflective equilibrium loop. The core of both loops is powered by finetuned VLMs. We analyze the capabilities of the framework in an offline inference setting, relying on an extensive set of recorded trajectories. Exemplary demonstrations of the analyzed problem domain are available at https://jity16.github.io/RoboGolf/
翻译:迷你高尔夫是检验具身智能的典型真实世界游戏,需要具备挑战性的空间与运动动力学理解能力才能完成推杆击球。此外,若挑战的可行性无法保证,则需进行反思性推理。本文提出RoboGolf——一个基于视觉语言模型(VLM)的框架,该框架将双摄像头感知与闭环动作优化相结合,并通过反射均衡循环进行增强。两个循环的核心均由微调后的VLM驱动。我们在离线推理场景中,基于大量已记录的轨迹数据集,对该框架的能力进行了分析。相关领域问题的示例演示可在 https://jity16.github.io/RoboGolf/ 查看。