Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts.
翻译:现有的参照理解任务往往涉及对单个文本所指物体的检测。本文提出一项新颖且通用的参照理解任务,称为"参照多目标跟踪"(referring multi-object tracking, RMOT)。其核心思想是利用语言表达作为语义线索,引导多目标跟踪的预测。据我们所知,这是首次在视频中实现对任意数量所指物体进行预测的工作。为推进RMOT研究,我们基于KITTI数据集构建了一个包含可扩展表达式的基准数据集——Refer-KITTI。具体而言,该数据集包含18个视频片段和818条表达式,每条表达式在对应视频中平均标注10.7个目标物体。此外,我们设计了基于Transformer架构的在线处理方法TransRMOT,该模型展现出令人瞩目的检测性能,并优于其他对比方法。