In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.
翻译:本研究探讨了基于操作前与操作后的第一视角图像及指令语句,预测机械臂执行开放词汇操作任务成功率的问题。传统方法,包括多模态大语言模型(MLLMs),常常难以恰当理解物体的详细特征和/或物体位置的细微变化。我们提出了对比式 $\lambda$-Repformer,该方法通过对齐图像与指令语句来预测桌面操作任务的成功率。我们的方法将以下三种关键类型的特征集成到一个多层级对齐表征中:保留局部图像信息的特征;与自然语言对齐的特征;以及通过自然语言结构化的特征。这使得模型能够通过观察两幅图像之间表征的差异来关注重要的变化。我们在基于大规模标准数据集RT-1构建的数据集上以及实体机器人平台上评估了对比式 $\lambda$-Repformer。结果表明,我们的方法优于包括MLLMs在内的现有方法。我们最佳模型的准确率相较于代表性的基于MLLM的模型提高了8.66个百分点。