For Embodied AI, jointly reconstructing dynamic hands and the dense scene context is crucial for understanding physical interaction. However, most existing methods recover isolated hands in local coordinates, overlooking the surrounding 3D environment. To address this, we present Hand3R, the first online framework for joint 4D hand-scene reconstruction from monocular video. Hand3R synergizes a pre-trained hand expert with a 4D scene foundation model via a scene-aware visual prompting mechanism. By injecting high-fidelity hand priors into a persistent scene memory, our approach enables simultaneous reconstruction of accurate hand meshes and dense metric-scale scene geometry in a single forward pass. Experiments demonstrate that Hand3R bypasses the reliance on offline optimization and delivers competitive performance in both local hand reconstruction and global positioning.
翻译:对于具身智能而言,联合重建动态手部与稠密场景上下文对于理解物理交互至关重要。然而,现有方法大多在局部坐标系中恢复孤立的手部,忽略了周围的三维环境。为此,我们提出了Hand3R,首个从单目视频进行联合4D手部-场景重建的在线框架。Hand3R通过场景感知的视觉提示机制,将预训练的手部专家模型与4D场景基础模型相协同。通过将高保真的手部先验注入到持久化场景记忆中,我们的方法能够在单次前向传播中同时重建精确的手部网格和稠密度量尺度场景几何。实验表明,Hand3R无需依赖离线优化,并在局部手部重建与全局定位方面均展现出具有竞争力的性能。