Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
翻译:从RGB相机等普遍传感器中精确捕捉人-物交互,对于人类理解、游戏和机器人学习等应用至关重要。然而,由于未知物体和人体信息、深度模糊、遮挡及复杂运动,从单一RGB视角推断四维交互极具挑战性,这些因素阻碍了一致的三维和时序重建。以往方法通过假设提供真实物体模板或限制在有限物体类别范围内来简化设置。我们提出了CARI4D,这是首个从单目RGB视频中重建空间和时间一致、以公制尺度度量的人-物交互四维信息的类别无关方法。为此,我们提出了一种姿态假设选择算法,该算法鲁棒地整合了来自基础模型的独立预测,并通过一种学习型渲染-比较范式对它们进行联合优化,以确保空间、时间和像素对齐,最终推理精细接触以进行满足物理约束的进一步优化。实验表明,在分布内数据集上,我们的方法的重建误差较先前技术降低了38%,在未见过的数据集上降低了36%。我们的模型泛化能力超越了训练类别,因此可以零样本应用于野外互联网视频。我们的代码和预训练模型将公开发布。