Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
翻译:预训练的通用视觉语言模型因其丰富的世界知识和二维物体检测能力,具备提升直觉性人机交互的潜力。然而,适用于三维坐标检测任务的视觉语言模型尚属罕见。本研究通过给定腕戴式相机的单目RGB图像、自然语言输入及机器人状态,探索视觉语言模型返回三维物体位置的交互能力。我们收集并构建了一个包含超过10万张图像的异构数据集,并采用QLoRA方法与定制回归头对视觉语言模型进行微调。通过实现条件路由机制,我们的模型在保持处理通用视觉查询能力的同时,新增了专业化的三维位置估计功能。实验结果表明,模型在测试集上取得了稳健的预测性能,中位平均绝对误差为13毫米,较未经微调的简易基线模型提升五倍。在约25%的案例中,预测结果处于机器人可与物体进行交互的可接受误差范围内。