Most model-free visual object tracking methods formulate the tracking task as object location estimation given by a 2D segmentation or a bounding box in each video frame. We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames, including frames where some points are invisible. To achieve that, the estimation is driven by re-rendering the input video frames as well as possible through differentiable rendering, which has not been used for tracking before. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose. We improve the state-of-the-art in 2D segmentation tracking on three different datasets with mostly rigid objects.
翻译:大多数无模型视觉目标跟踪方法将跟踪任务定义为通过每帧视频中的二维分割或边界框给出的目标位置估计。我们认为这种表示具有局限性,并提出利用显式目标表示(即每帧视频中带纹理的三维形状和六自由度位姿)来引导和改善二维跟踪。我们的表示方法解决了所有视频帧中目标三维点之间复杂的长期密集对应问题,包括某些点不可见的帧。为此,我们通过尽可能好的可微分渲染重绘输入视频帧来驱动估计过程,这一方法此前尚未用于跟踪领域。所提出的优化方法采用新型损失函数,以估计最佳的三维形状、纹理和六自由度位姿。我们在三个主要包含刚体物体的数据集上,将二维分割跟踪的最新技术水平提升到新高度。