In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.
翻译:本文提出了一种简单且强大的基于Transformer的任意点跟踪框架(TAPTR)。基于点跟踪与目标检测及跟踪任务高度相似这一观察,我们借鉴了类DETR算法的设计来解决TAP任务。在所提出的框架中,每个视频帧中的每个跟踪点被表示为一个点查询,该查询由位置部分和内容部分组成。与DETR类似,每个查询(其位置和内容特征)在逐层处理中得到自然更新。其可见性通过更新后的内容特征进行预测。属于同一跟踪点的查询可通过时间维度上的自注意力机制交换信息。由于类DETR算法中已对这些操作进行了精心设计,该模型在概念上非常简洁。我们还采用了光流模型中的代价体积等实用设计,并开发了简单方案以提供长时程时间信息,同时缓解特征漂移问题。我们的框架在多个TAP数据集上展现出优异性能,达到了最先进水平,且推理速度更快。