GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

from arxiv, Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

翻译：人类视觉系统通过整合当前观测与先前观察到的信息、适应目标与场景变化，并以细粒度推理遮挡来跟踪目标。相比之下，近期的通用目标跟踪器通常针对训练目标进行优化，这限制了其在未见场景中的鲁棒性与泛化能力，且其遮挡推理仍较为粗糙，缺乏对遮挡模式的精细建模。为应对泛化与遮挡感知方面的这些局限，我们提出了GOT-JEPA——一种将JEPA从预测图像特征扩展至预测跟踪模型的模型预测预训练框架。在给定相同历史信息的条件下，教师预测器从干净的当前帧生成伪跟踪模型，而学生预测器则学习从受损的当前帧版本中预测相同的伪跟踪模型。该设计提供了稳定的伪监督，并显式训练预测器在遮挡、干扰物及其他不利观测条件下生成可靠的跟踪模型，从而提升对动态环境的泛化能力。基于GOT-JEPA，我们进一步提出OccuSolver以增强目标跟踪的遮挡感知能力。OccuSolver将一种以点为中心的点跟踪器适配于目标感知的可见性估计与细粒度遮挡模式捕获。通过以跟踪器迭代生成的目标先验为条件，OccuSolver逐步优化可见性状态、强化遮挡处理，并生成更高质量的参考标签，从而持续改进后续模型预测。在七个基准测试上的广泛评估表明，我们的方法有效提升了跟踪器的泛化能力与鲁棒性。