State recognition of the environment and objects, such as the open/closed state of doors and the on/off of lights, is indispensable for robots that perform daily life support and security tasks. Until now, state recognition methods have been based on training neural networks from manual annotations, preparing special sensors for the recognition, or manually programming to extract features from point clouds or raw images. In contrast, we propose a robotic state recognition method using a pre-trained vision-language model, which is capable of Image-to-Text Retrieval (ITR) tasks. We prepare several kinds of language prompts in advance, calculate the similarity between these prompts and the current image by ITR, and perform state recognition. By applying the optimal weighting to each prompt using black-box optimization, state recognition can be performed with higher accuracy. Experiments show that this theory enables a variety of state recognitions by simply preparing multiple prompts without retraining neural networks or manual programming. In addition, since only prompts and their weights need to be prepared for each recognizer, there is no need to prepare multiple models, which facilitates resource management. It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not, which have been challenging so far, through language.
翻译:环境与物体的状态识别(如门的开闭状态、灯的开关状态)对于执行日常生活支持与安防任务的机器人而言不可或缺。迄今为止,状态识别方法主要依赖于基于人工标注训练神经网络、为识别任务配置专用传感器,或通过手动编程从点云或原始图像中提取特征。与之相对,本文提出一种利用预训练视觉语言模型实现机器人状态识别的方法,该模型具备图像-文本检索任务能力。我们预先准备多种语言提示,通过ITR计算这些提示与当前图像的相似度,进而实现状态识别。通过应用黑盒优化对每个提示进行最优加权,可以更高精度地完成状态识别。实验表明,该方法仅需准备多个提示而无需重新训练神经网络或手动编程,即可实现多种状态识别。此外,由于每个识别器仅需准备提示及其权重,无需配置多个模型,这有利于资源管理。该方法能够通过语言识别透明门的开闭状态、水龙头流水与否的状态,甚至厨房清洁程度的定性状态等以往具有挑战性的识别任务。