GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

from arxiv, Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

翻译：人类视觉系统通过整合当前观测与先前观察信息、适应目标及场景变化，并以精细粒度推理遮挡情况来跟踪目标。相比之下，现有通用目标跟踪器通常针对训练目标进行优化，导致在未见场景中鲁棒性与泛化能力受限，且其遮挡推理仍较粗糙，缺乏对遮挡模式的精细建模。为解决泛化与遮挡感知方面的局限，我们提出GOT-JEPA——一种将JEPA从预测图像特征扩展至预测跟踪模型的模型预测预训练框架。给定相同历史信息，教师预测器从清晰当前帧生成伪跟踪模型，学生预测器则从当前帧的退化版本学习预测相同伪跟踪模型。该设计提供稳定的伪监督信号，明确训练预测器在遮挡、干扰物及其他不利观测条件下生成可靠跟踪模型的能力，从而提升对动态环境的泛化能力。基于GOT-JEPA，我们进一步提出OccuSolver以增强目标跟踪中的遮挡感知能力。OccuSolver采用以点为中心的点跟踪器实现目标感知的可见性估计与精细遮挡模式捕获。通过利用跟踪器迭代生成的目标先验，OccuSolver逐步优化可见性状态，强化遮挡处理，并生成更高质量的参考标签，从而持续提升后续模型预测效果。在七个基准数据集上的广泛评估表明，本方法能有效增强跟踪器的泛化性与鲁棒性。