Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.
翻译:人类在二维视频流中进行有效目标跟踪的感知能力,源于对先验三维知识的隐式运用与语义推理的结合。相比之下,大多数通用目标跟踪方法主要依赖于目标及其周围环境的二维特征,而忽略了三维几何线索,这使得它们容易受到部分遮挡、干扰物以及几何与外观变化的影响。为克服这一局限,我们提出了GOT-Edit,一种在线跨模态模型编辑方法,能够将几何感知线索从二维视频流中整合到通用目标跟踪器中。我们的方法利用预训练的视觉几何基础Transformer的特征,仅从少量二维图像即可推断几何线索。为应对几何与语义无缝融合的挑战,GOT-Edit通过零空间约束更新进行在线模型编辑,在融入几何信息的同时保持语义判别能力,从而在各种场景中实现持续更优的性能。在多个通用目标跟踪基准上的大量实验表明,GOT-Edit实现了卓越的鲁棒性和准确性,尤其在遮挡和杂乱环境下,为将二维语义与三维几何推理相结合进行通用目标跟踪建立了新范式。