Human perception for effective object tracking in 2D video streams arises from the implicit use of prior 3D knowledge and semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings, while neglecting 3D geometric cues, making them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to infer geometric cues from only a few 2D images. To address the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing. By leveraging null-space constraints during model updates, it incorporates geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking. The project page is available at https://chenshihfang.github.io/GOT-EDIT.
翻译:人类在二维视频流中进行有效目标跟踪的感知能力,源于对先验三维知识和语义推理的隐式运用。相比之下,大多数通用目标跟踪方法主要依赖目标及其周围环境的二维特征,而忽略了三维几何线索,这使得它们容易受到部分遮挡、干扰物以及几何与外观变化的影响。为解决这一局限,我们提出了GOT-Edit,一种在线跨模态模型编辑方法,它能够从二维视频流中将几何感知线索整合到通用目标跟踪器中。我们的方法利用预训练的视觉几何基础Transformer的特征,仅从少量二维图像中推断几何线索。为了应对无缝结合几何与语义的挑战,GOT-Edit执行在线模型编辑。通过在模型更新过程中利用零空间约束,它在引入几何信息的同时保持了语义判别能力,从而在各种场景中持续获得更优的性能。在多个通用目标跟踪基准上进行的大量实验表明,GOT-Edit实现了卓越的鲁棒性和准确性,尤其是在遮挡和杂乱环境下,为将二维语义与三维几何推理相结合进行通用目标跟踪建立了一种新范式。项目页面位于 https://chenshihfang.github.io/GOT-EDIT。