TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.

翻译：尽管视频合成领域取得了显著成就，但在动态世界建模中实现对复杂动态（如多个交互对象间的细微运动）的细粒度控制仍是一个重大挑战。这需要应对对象的出现与消失、剧烈的尺度变化，并确保跨帧实例的一致性。这些挑战阻碍了能够忠实模拟真实世界复杂性的视频生成技术的发展，限制了其在需要高度真实感与可控性的应用（如高级场景模拟和感知系统训练）中的效用。为此，我们提出TrackDiffusion——一种新颖的视频生成框架，通过扩散模型实现细粒度的轨迹条件运动控制，可精确操控对象轨迹与交互，克服了尺度与连续性中断的常见局限。该框架的核心组件是实例增强器（instance enhancer），其明确保证了多个对象的跨帧一致性——这一关键因素在现有文献中常被忽视。此外，我们证明TrackDiffusion生成的视频序列可直接用作视觉感知模型的训练数据。据我们所知，这是首个应用带有轨迹条件的视频扩散模型，并证明生成帧可用于提升目标跟踪器性能的研究工作。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日