Synthetic simulation data and real-world human data provide scalable alternatives to circumvent the prohibitive costs of robot data collection. However, these sources suffer from the sim-to-real visual gap and the human-to-robot embodiment gap, respectively, which limits the policy's generalization to real-world scenarios. In this work, we identify a natural yet underexplored complementarity between these sources: simulation offers the robot action that human data lacks, while human data provides the real-world observation that simulation struggles to render. Motivated by this insight, we present SimHum, a co-training framework to simultaneously extract kinematic prior from simulated robot actions and visual prior from real-world human observations. Based on the two complementary priors, we achieve data-efficient and generalizable robotic manipulation in real-world tasks. Empirically, SimHum outperforms the baseline by up to $\mathbf{40\%}$ under the same data collection budget, and achieves a $\mathbf{62.5\%}$ OOD success with only 80 real data, outperforming the real only baseline by $7.1\times$. Videos and additional information can be found at \href{https://kaipengfang.github.io/sim-and-human}{project website}.
翻译:合成仿真数据与真实世界人类数据为规避机器人数据采集的高昂成本提供了可扩展的替代方案。然而,这两种数据源分别存在仿真到现实的视觉差异以及人类到机器人的形态差异,从而限制了策略在真实场景中的泛化能力。本研究发现,这两种数据源之间存在一种天然但尚未被充分探索的互补性:仿真数据提供了人类数据所缺乏的机器人动作信息,而人类数据则提供了仿真难以渲染的真实世界观测信息。基于这一洞见,我们提出了SimHum协同训练框架,该框架能够同时从仿真的机器人动作中提取运动学先验,并从真实世界的人类观测中提取视觉先验。依托这两种互补的先验知识,我们在真实世界任务中实现了数据高效且泛化性强的机器人操控。实验表明,在相同数据采集预算下,SimHum的性能最高可超越基线方法达$\mathbf{40\%}$;仅使用80条真实数据即可实现$\mathbf{62.5\%}$的分布外任务成功率,较纯真实数据基线提升$7.1$倍。视频及更多信息请访问项目网站:\href{https://kaipengfang.github.io/sim-and-human}{project website}。