Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
翻译:前馈式多帧三维重建模型在处理包含物体运动的视频时性能往往下降。在多重运动下全局参考系变得模糊不清,而局部点云图高度依赖估计的相对位姿且易产生漂移,导致跨帧错位与结构重复。我们提出TrajVG——一种通过估计相机坐标系三维轨迹来显式预测跨帧三维对应关系的重建框架。我们将稀疏轨迹、逐帧局部点云图与相对相机位姿通过几何一致性目标进行耦合:(i) 采用梯度流控制的双向轨迹-点云图一致性约束,(ii) 通过静态轨迹锚点驱动的位姿一致性目标以抑制动态区域的梯度影响。为适应野外视频数据中三维轨迹标签稀缺的训练场景,我们将相同耦合约束重构为仅需伪二维轨迹的自监督目标,实现混合监督下的统一训练。在三维跟踪、位姿估计、点云重建及视频深度等任务上的大量实验表明,TrajVG超越了当前前馈式性能基线。