Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions. State-of-the-art static reconstruction systems, such as MapAnything, often degrade in these settings, suffering from catastrophic trajectory drift and "ghost" geometry caused by moving hands. We bridge this gap by proposing a robust pipeline that adapts static reconstruction backbones to long-form egocentric video. Our approach introduces a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in the attention layers, preventing hand artifacts from contaminating the static map. Furthermore, we employ a chunked reconstruction strategy with pose-graph stitching to ensure global consistency and eliminate long-term drift. Experiments on HD-EPIC and indoor drone datasets demonstrate that our pipeline significantly improves absolute trajectory error and yields visually clean static geometry compared to naive baselines, effectively extending the capability of foundation models to dynamic first-person scenes.
翻译:第一人称视频因快速相机运动与频繁动态交互而给三维重建带来独特挑战。现有最先进的静态重建系统(如MapAnything)在此类场景中常出现性能退化,包括轨迹漂移和由移动手部造成的"鬼影"几何结构。为弥合这一差距,我们提出了一套鲁棒流水线,使静态重建主干网络适应长时程第一人称视频。该方法引入掩码感知重建机制,在注意力层中显式抑制动态前景,防止手部伪影污染静态地图。进一步采用分块重建策略结合位姿图拼接,确保全局一致性并消除长期漂移。在HD-EPIC与室内无人机数据集上的实验表明,相较于朴素基线方法,本流水线显著提升绝对轨迹误差指标,生成视觉干净的静态几何结构,有效扩展了基础模型在动态第一人称场景中的能力边界。