Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/
翻译:通用机器人必须从多样化、大规模的人-物交互(HOI)中学习,才能在现实世界中稳健运行。单目互联网视频提供了近乎无限且易于获取的数据源,捕捉了无与伦比的人类活动、物体和环境多样性。然而,从这些野外视频中准确且可扩展地提取四维交互数据仍然是一个重大且尚未解决的挑战。因此,在本工作中,我们提出了4DHOISolver,一种新颖高效的优化框架,通过利用稀疏的人工参与接触点标注来约束病态的四维人-物交互重建问题,同时保持高度的时空一致性和物理合理性。基于此框架,我们构建了Open4DHOI——一个包含144种物体类型和103种动作的多样化大型四维人-物交互数据集。此外,我们通过使基于强化学习的智能体能够模仿重建的运动,验证了重建结果的有效性。然而,对现有三维基础模型的综合基准测试表明,自动预测精确的人-物接触对应关系仍是一个未解难题,这既凸显了我们人工参与策略的紧迫必要性,也为学界提出了一个开放性挑战。数据与代码将在https://wenboran2002.github.io/open4dhoi/公开提供。