AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.

翻译：从单目视频中重建动态手物交互对于灵巧操作数据收集以及为机器人和虚拟现实创建逼真数字孪生体至关重要。然而，当前方法面临两大阻碍：（1）依赖神经渲染通常会在严重遮挡下产生碎片化、无法用于仿真的几何结构；（2）依赖脆弱的运动恢复结构（SfM）初始化，在处理野外视频时频繁失效。为克服这些限制，我们提出AGILE——一个将交互学习范式从重建转向智能体生成的鲁棒框架。首先，我们采用智能体流程：视觉语言模型（VLM）引导生成模型合成完整、水密的物体网格并附带高保真纹理，不受视频遮挡影响。其次，完全绕过脆弱的SfM，我们提出鲁棒的锚点-跟踪策略：利用基础模型在单个交互起始帧初始化物体姿态，并通过生成资产与视频观测之间的强视觉相似性在时间域上传播姿态。最后，接触感知优化整合语义、几何与交互稳定性约束，确保物理合理性。在HO3D、DexYCB、ARCTIC及野外视频上的大量实验表明，AGILE在全局几何精度上超越基线方法，并在先前方法频繁失败的挑战性序列上展现出卓越鲁棒性。通过优先保证物理有效性，我们的方法可生成经真实到仿真重定向验证的、可用于机器人应用的仿真就绪资产。项目页面：https://agile-hoi.github.io。