Locating an object in a sequence of frames, given its appearance in the first frame of the sequence, is a hard problem that involves many stages. Usually, state-of-the-art methods focus on bringing novel ideas in the visual encoding or relational modelling phases. However, in this work, we show that bounding box regression from learned joint search and template features is of high importance as well. While previous methods relied heavily on well-learned features representing interactions between search and template, we hypothesize that the receptive field of the input convolutional bounding box network plays an important role in accurately determining the object location. To this end, we introduce two novel bounding box regression networks: inception and deformable. Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks: the GOT-10k, the UAV123 and the OTB2015.
翻译:在给定序列首帧中目标外观的情况下,对该目标在帧序列中的定位是一个涉及多阶段的难题。现有先进方法通常聚焦于视觉编码或关系建模阶段引入新颖思路。然而,本研究证明,从学习的联合搜索与模板特征中进行边界框回归同样至关重要。与以往过度依赖表征搜索与模板交互特性的强学习特征的方法不同,我们假设输入卷积边界框网络的感受野在精确确定目标位置中扮演重要角色。为此,我们提出两种新型边界框回归网络:Inception和可变形网络。实验与消融研究表明,将所提出的Inception模块应用于最新ODTrack框架后,在GOT-10k、UAV123和OTB2015三个基准数据集上均取得更优性能。