TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.

翻译：扩散模型在图像分类、目标检测等感知任务数据生成领域已崭露头角。然而，其在生成高质量跟踪序列（视频感知领域的关键环节）方面的潜力尚未得到充分发掘。为填补这一空白，我们提出TrackDiffusion——一种专为从轨迹片段生成连续视频序列而设计的新型架构。TrackDiffusion显著突破了传统布局到图像生成及以边界框等静态图像元素为中心的复制粘贴合成范式，通过赋予图像扩散模型捕捉动态连续跟踪轨迹的能力，从而捕获复杂的运动细节并确保视频帧间的实例一致性。我们首次证明，生成的视频序列可有效用于训练多目标跟踪系统，显著提升跟踪器性能。实验结果表明，该模型能显著增强生成视频序列的实例一致性，进而提升感知指标。在YTVIS数据集上，我们的方法在TrackAP和TrackAP$_{50}$指标上分别取得了8.7和11.8的提升，充分彰显其重塑MOT等任务视频数据生成标准的潜力。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【KDD 2020】M2GRL: 一个多任务多视角图表示学习框架的Web-scale的推荐系统，M2GRL: A Multi-task Multi-view Graph Representation Learning Framework for Web-scale Recommender Systems

专知会员服务

29+阅读 · 2020年6月30日