We aim to discover manipulation concepts embedded in the unannotated demonstrations, which are recognized as key physical states. The discovered concepts can facilitate training manipulation policies and promote generalization. Current methods relying on multimodal foundation models for deriving key states usually lack accuracy and semantic consistency due to limited multimodal robot data. In contrast, we introduce an information-theoretic criterion to characterize the regularities that signify a set of physical states. We also develop a framework that trains a concept discovery network using this criterion, thus bypassing the dependence on human semantics and alleviating costly human labeling. The proposed criterion is based on the observation that key states, which deserve to be conceptualized, often admit more physical constraints than non-key states. This phenomenon can be formalized as maximizing the mutual information between the putative key state and its preceding state, i.e., Maximal Mutual Information (MaxMI). By employing MaxMI, the trained key state localization network can accurately identify states of sufficient physical significance, exhibiting reasonable semantic compatibility with human perception. Furthermore, the proposed framework produces key states that lead to concept-guided manipulation policies with higher success rates and better generalization in various robotic tasks compared to the baselines, verifying the effectiveness of the proposed criterion.
翻译:我们旨在从无标注的演示数据中发现嵌入的操作概念,这些概念被识别为关键物理状态。所发现的概念能够促进操作策略的训练并提升泛化能力。当前依赖多模态基础模型来推导关键状态的方法,通常由于多模态机器人数据的有限性而缺乏准确性和语义一致性。相比之下,我们引入了一种信息论准则来表征标志着一组物理状态的规律性。我们还开发了一个框架,利用该准则训练概念发现网络,从而绕过对人类语义的依赖并减轻昂贵的人工标注成本。所提出的准则基于以下观察:值得被概念化的关键状态,通常比非关键状态承受更多的物理约束。这一现象可以通过最大化候选关键状态与其前驱状态之间的互信息来形式化,即最大互信息(MaxMI)。通过采用MaxMI,训练得到的关键状态定位网络能够准确识别具有足够物理重要性的状态,并展现出与人类感知合理的语义兼容性。此外,与基线方法相比,所提出的框架产生的关键状态能够引导概念驱动的操作策略,在多种机器人任务中取得更高的成功率和更好的泛化性能,从而验证了所提准则的有效性。