The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.
翻译:具身智能与空间计算领域对高保真四维手物交互数据的需求日益增长,而当前技术瓶颈在于对预先扫描的物体模板和物理标记的依赖。尽管近期方法已展现出从视频重建四维手物交互的潜力,但这类方法对手部与物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态极具挑战性,尤其是在手物交互场景中固有的严重遮挡条件下。本文提出一种新型系统,可在无需模板或标记的情况下,从同步标定的多视角视频中实现手部与物体的鲁棒精确重建。该系统包含两大核心创新组件:(1)一种多视角前馈Transformer模型,该模型通过聚合跨视角几何与时间线索,为姿态与稠密物体几何提供可靠的度量一致性初始化;(2)一种基于手物物理感知高斯的优化框架,通过集成四面体约束、碰撞修正与外观分解,对初始估计进行精化,从而生成物理合理且视觉精确的重建结果。在公开基准与大规模内部数据集上的验证表明,本方法实现了高度鲁棒、无伪影的重建效果,为自动化四维资产生成提供了高效基础。项目页面详见https://zyshen021.github.io/HOSTPG/。