Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
翻译:从RGB摄像头等普适传感器中准确捕捉人-物交互,对于人类行为理解、游戏和机器人学习等应用至关重要。然而,从单一RGB视角推断四维交互极具挑战性,原因在于物体与人体信息未知、深度模糊、遮挡以及复杂运动,这些因素阻碍了具有三维一致性与时间一致性的重建。现有方法通常通过假设已知真实物体模板或限定于有限物体类别集合来简化问题设置。本文提出CARI4D,首个类别无关的方法,能够从单目RGB视频中重建具有空间与时间一致性的、公制度量尺度的四维人-物交互。为此,我们提出一种姿态假设选择算法,该算法鲁棒地整合来自基础模型的个体预测,通过一种习得的渲染-比较范式对其进行联合优化,以确保空间、时间及像素对齐,最终对复杂接触关系进行推理以实现满足物理约束的进一步细化。实验表明,在重建误差指标上,我们的方法在分布内数据集上优于先前最佳方法38%,在未见数据集上优于36%。我们的模型能够泛化至训练类别之外,因此可以零样本应用于真实场景的网络视频。我们的代码与预训练模型将公开发布。