Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what's important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks. Extensive experiments across a range of robotic tasks and embodiments, in both simulations and real-world environments, show that our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders including R3M, MVP, and EgoVLP, for downstream manipulation policy-learning. Project page: https://sites.google.com/view/human-oriented-robot-learning
翻译:人类天生具备可泛化的视觉表示,这使他们能够在操作任务中高效探索并与环境交互。我们主张,这种表示自动产生于同时学习日常场景中关键的多项简单感知技能(例如手部检测、状态估计等),并且相较于当前仅基于自监督目标的最先进视觉表示,更适合用于学习机器人操作策略。我们通过在预训练视觉编码器上进行面向人类的多任务微调来形式化这一思想,其中每个任务都是一项与人类-环境交互相关的感知技能。我们引入了任务融合解码器作为一种即插即用的嵌入转换器,该解码器利用这些感知技能之间的潜在关系,引导表示学习朝着编码对所有感知技能重要的有意义结构方向发展,最终赋能下游机器人操作任务的学习。在多种机器人任务、实体形态以及仿真和真实环境中的广泛实验表明,我们的任务融合解码器持续改进了包括R3M、MVP和EgoVLP在内的三种最先进视觉编码器在下游操作策略学习中的表示性能。项目页面:https://sites.google.com/view/human-oriented-robot-learning