The global multi-object tracking (MOT) system can consider interaction, occlusion, and other ``visual blur'' scenarios to ensure effective object tracking in long videos. Among them, graph-based tracking-by-detection paradigms achieve surprising performance. However, their fully-connected nature poses storage space requirements that challenge algorithm handling long videos. Currently, commonly used methods are still generated trajectories by building one-forward associations across frames. Such matches produced under the guidance of first-order similarity information may not be optimal from a longer-time perspective. Moreover, they often lack an end-to-end scheme for correcting mismatches. This paper proposes the Composite Node Message Passing Network (CoNo-Link), a multi-scene generalized framework for modeling ultra-long frames information for association. CoNo-Link's solution is a low-storage overhead method for building constrained connected graphs. In addition to the previous method of treating objects as nodes, the network innovatively treats object trajectories as nodes for information interaction, improving the graph neural network's feature representation capability. Specifically, we formulate the graph-building problem as a top-k selection task for some reliable objects or trajectories. Our model can learn better predictions on longer-time scales by adding composite nodes. As a result, our method outperforms the state-of-the-art in several commonly used datasets.
翻译:全局多目标跟踪系统能够考虑交互、遮挡及其他“视觉模糊”场景,确保在长视频中实现有效的目标跟踪。其中,基于图模型的检测追踪范式取得了令人瞩目的性能。然而,其全连接特性带来的存储空间需求给算法处理长视频带来了挑战。当前常用方法仍通过构建帧间单向关联生成轨迹,这种基于一阶相似性信息指导生成的匹配结果可能并非长时间尺度下的最优解。此外,这类方法往往缺乏端到端的纠错机制。本文提出复合节点消息传递网络(CoNo-Link)——一个面向超长帧关联信息建模的多场景通用框架。CoNo-Link方案采用低存储开销方法构建约束连通图。与既有方法将目标作为节点不同,本网络创新性地将目标轨迹视为信息交互节点,提升了图神经网络的特征表征能力。具体而言,我们将图构建问题转化为针对可靠目标或轨迹的top-k选择任务。通过引入复合节点,模型能够在更长的时间尺度上学习更优的预测结果。实验表明,本方法在多个常用数据集上均超越了现有最优方法。