Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field. Its task form is to guide the tracker to track objects that match the language description. Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences. However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description. In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT). It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task. CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view. To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack. Specifically, it provides 13 different scenes and 221 language descriptions. Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker. Extensive experiments on the CRTrack benchmark verify the effectiveness of our method. The dataset and code are available at https://github.com/chen-si-jia/CRMOT.
翻译:指代多目标跟踪(RMOT)是当前跟踪领域的重要课题,其任务形式是通过语言描述引导跟踪器追踪与之匹配的目标。现有研究主要集中于单视角下的指代多目标跟踪,即针对单一视角序列或多个无关视角序列进行跟踪。然而在单视角条件下,目标的部分外观特征极易因遮挡而不可见,导致目标与语言描述的匹配出现偏差。本研究提出一项新任务——跨视角指代多目标跟踪(CRMOT),通过引入跨视角信息从多视角获取目标外观特征,从而规避RMOT任务中目标外观不可见的问题。CRMOT是一项更具挑战性的任务,要求准确追踪与语言描述匹配的目标,并保持目标在各跨视角下的身份一致性。为推进CRMOT研究,我们基于CAMPUS和DIVOTrack数据集构建了跨视角指代多目标跟踪基准CRTrack,该基准涵盖13个不同场景与221条语言描述。此外,我们提出一种端到端的跨视角指代多目标跟踪方法CRTracker。在CRTrack基准上的大量实验验证了本方法的有效性。数据集与代码已发布于https://github.com/chen-si-jia/CRMOT。