Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting. This study introduces a more efficient training strategy to mitigate overfitting and reduce computational requirements. We balance the training process with a mix of negative and positive samples from the outset, named as Joint learning with Negative samples (JN). Negative samples refer to scenarios where the object from the template is not present in the search region, which helps to prevent the model from simply memorizing the target, and instead encourages it to use the template for object location. To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target's location in the presence of negative samples, offering an efficient way to manage the mixed sample training. Furthermore, our approach introduces a target-indicating token. It encapsulates the target's precise location within the template image. This method provides exact boundary details with negligible computational cost but improving performance. Our model, JN-256, exhibits superior performance on challenging benchmarks, achieving 75.8% AO on GOT-10k and 84.1% AUC on TrackingNet. Notably, JN-256 outperforms previous SOTA trackers that utilize larger models and higher input resolutions, even though it is trained with only half the number of data sampled used in those works.
翻译:当前最先进(SOTA)的视觉目标跟踪方法通常需要大量计算资源和海量训练数据,易导致过拟合风险。本研究提出一种更高效的训练策略,以缓解过拟合并降低计算需求。我们从训练初期就采用正负样本混合平衡训练过程,命名为负样本联合学习(JN)。负样本指模板中目标未出现在搜索区域的情形,这有助于防止模型简单记忆目标,转而促使模型利用模板进行目标定位。为有效处理负样本,我们采用基于分布的目标头部,将边界框建模为距离分布以表达目标在负样本存在时的不确定性定位,为混合样本训练提供高效管理方式。此外,本方法引入目标指示令牌,该令牌封装模板图像中目标的精确位置,以极低计算成本提供精确边界细节并提升性能。我们的模型JN-256在挑战性基准测试中表现优异,在GOT-10k上达到75.8%的平均重叠率(AO),在TrackingNet上达到84.1%的曲线下面积(AUC)。值得注意的是,尽管JN-256仅使用先前工作一半的采样数据量进行训练,其性能仍超越了采用更大模型和更高输入分辨率的现有SOTA跟踪器。