Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.
翻译:扩散模型在图像分类、目标检测等感知任务数据生成领域已崭露头角。然而,其在生成高质量跟踪序列(视频感知领域的关键环节)方面的潜力尚未得到充分发掘。为填补这一空白,我们提出TrackDiffusion——一种专为从轨迹片段生成连续视频序列而设计的新型架构。TrackDiffusion显著突破了传统布局到图像生成及以边界框等静态图像元素为中心的复制粘贴合成范式,通过赋予图像扩散模型捕捉动态连续跟踪轨迹的能力,从而捕获复杂的运动细节并确保视频帧间的实例一致性。我们首次证明,生成的视频序列可有效用于训练多目标跟踪系统,显著提升跟踪器性能。实验结果表明,该模型能显著增强生成视频序列的实例一致性,进而提升感知指标。在YTVIS数据集上,我们的方法在TrackAP和TrackAP$_{50}$指标上分别取得了8.7和11.8的提升,充分彰显其重塑MOT等任务视频数据生成标准的潜力。