Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories. Specifically, ProContEXT leverages a context-aware self-attention module to encode the spatial and temporal context, refining and updating the multi-scale static and dynamic templates to progressively perform accurately tracking. It explores the complementary between spatial and temporal context, raising a new pathway to multi-context modeling for transformer-based trackers. In addition, ProContEXT revised the token pruning technique to reduce computational complexity. Extensive experiments on popular benchmark datasets such as GOT-10k and TrackingNet demonstrate that the proposed ProContEXT achieves state-of-the-art performance.
翻译:现有视觉目标跟踪(VOT)仅将第一帧中的目标区域作为模板。这导致在快速变化和拥挤场景中跟踪不可避免地失败,因为它无法考虑帧间目标外观的变化。为此,我们使用渐进式上下文编码Transformer跟踪器(ProContEXT)重构了跟踪框架,该框架一致地利用空间和时间上下文来预测目标运动轨迹。具体而言,ProContEXT利用上下文自注意力模块对空间和时间上下文进行编码,精炼并更新多尺度静态和动态模板,以逐步实现精确跟踪。它探索了空间与时间上下文之间的互补性,为基于Transformer的跟踪器开辟了多上下文建模的新途径。此外,ProContEXT改进了令牌剪枝技术以降低计算复杂度。在GOT-10k和TrackingNet等主流基准数据集上的大量实验表明,所提出的ProContEXT达到了最先进的性能。