Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progressions; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/
翻译:多模态预训练是实现自主机器人表征学习三重目标的有效策略:1)提取局部与全局任务进展;2)强化视觉表征的时间一致性;3)捕获轨迹级语言基础。现有方法大多通过分离目标实现这些功能,往往只能获得次优解。本文提出一种通用的统一目标函数,能够同时从图像序列中提取有意义的任务进展信息,并使其与语言指令无缝对齐。我们发现,通过隐式偏好(即视觉轨迹天然与其对应语言指令的匹配度高于非匹配组合),流行的Bradley-Terry模型可通过适当的奖励重参数化转化为表征学习框架。由此产生的DecisionNCE框架借鉴了InfoNCE风格的目标函数,但专门针对决策任务进行了独特设计:通过隐式时间对比学习强化时间一致性,同时通过多模态联合编码确保轨迹级指令基础,从而构建出能够优雅提取局部与全局任务进展特征的具身表征学习框架。在仿真与真实机器人上的实验表明,DecisionNCE能有效促进多样化下游策略学习任务,为统一表征与奖励学习提供了通用解决方案。项目页面:https://2toinf.github.io/DecisionNCE/