The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
翻译:多目标跟踪任务的新趋势是利用自然语言跟踪感兴趣的目标。然而,成对的提示-实例数据的稀缺性阻碍了该任务的发展。为解决这一挑战,我们提出了一种基于虚幻引擎5的高质量低成本数据生成方法,并构建了一个全新的基准数据集,命名为Refer-UE-City,该数据集主要包括交叉口监控视频中的场景,详细描述了行人和车辆的外观与行为。具体而言,它提供了14个视频,共包含714条描述,规模与Refer-KITTI数据集相当。此外,我们提出了一个名为MLS-Track的多层级语义引导多目标框架,通过引入语义引导模块(SGM)和语义关联分支(SCB),逐步增强模型与文本之间的交互。在Refer-UE-City和Refer-KITTI数据集上的大量实验表明,我们提出的框架具有有效性,并取得了最先进的性能。代码和数据集将公开提供。