This study presents a control framework leveraging vision language models (VLMs) for multiple tasks and robots. Notably, existing control methods using VLMs have achieved high performance in various tasks and robots in the training environment. However, these methods incur high costs for learning control policies for tasks and robots other than those in the training environment. Considering the application of industrial and household robots, learning in novel environments where robots are introduced is challenging. To address this issue, we propose a control framework that does not require learning control policies. Our framework combines the vision-language CLIP model with a randomized control. CLIP computes the similarity between images and texts by embedding them in the feature space. This study employs CLIP to compute the similarity between camera images and text representing the target state. In our method, the robot is controlled by a randomized controller that simultaneously explores and increases the similarity gradients. Moreover, we fine-tune the CLIP to improve the performance of the proposed method. Consequently, we confirm the effectiveness of our approach through a multitask simulation and a real robot experiment using a two-wheeled robot and robot arm.
翻译:本研究提出一种基于视觉语言模型(VLMs)的控制框架,可适用于多任务与多机器人场景。值得注意的是,现有基于VLM的控制方法虽在训练环境中的各类任务与机器人上取得了优异性能,但在处理训练环境之外的任务与机器人时,学习控制策略会产生高昂成本。考虑到工业与家用机器人的实际应用需求,在机器人新部署环境中进行学习具有挑战性。为解决这一问题,我们提出一种无需学习控制策略的控制框架。该框架将视觉语言CLIP模型与随机控制相结合:CLIP通过将图像与文本嵌入特征空间计算其相似度。本研究利用CLIP计算相机图像与表征目标状态的文本之间的相似度。在具体方法中,机器人通过一个同时探索并增强相似度梯度的随机控制器进行控制。此外,我们通过微调CLIP模型以提升所提方法的性能。最终,我们通过多任务仿真实验以及采用双轮机器人与机械臂的真实机器人实验验证了该方法的有效性。