Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named \textbf{ODTrack}, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new \textit{SOTA} performance on seven benchmarks, while running at real-time speed. Code and models are available at \url{https://github.com/GXNU-ZhongLab/ODTrack}.
翻译:在线跨连续视频帧的上下文推理与关联对于视觉跟踪中感知目标至关重要。然而,当前多数高性能跟踪器仍倾向于通过离线模式依赖参考帧与搜索帧之间的稀疏时序关系。因此,它们只能在每个图像对内部独立交互,建立有限的时序关联。为解决上述问题,我们提出一种简单、灵活且有效的视频级跟踪框架 \textbf{ODTrack},该框架以在线令牌传播方式密集关联视频帧的上下文关系。ODTrack 可接收任意长度的视频帧序列,捕获目标的时空轨迹关系,并将目标的判别特征(定位信息)压缩为令牌序列,实现帧间关联。该新方案带来以下优势:1)纯化后的令牌序列可作为下一视频帧推理的提示,利用历史信息引导未来推理;2)通过令牌序列的迭代传播有效避免复杂的在线更新策略,从而获得更高效的模型表征与计算。ODTrack 在七个基准测试中实现了新的 \textit{SOTA} 性能,同时保持实时运行速度。代码与模型已开源:\url{https://github.com/GXNU-ZhongLab/ODTrack}。