Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.
翻译:尽管视频合成领域取得了显著成就,但在动态世界建模中实现对复杂动态(如多个交互对象间的细微运动)的细粒度控制仍是一个重大挑战。这需要应对对象的出现与消失、剧烈的尺度变化,并确保跨帧实例的一致性。这些挑战阻碍了能够忠实模拟真实世界复杂性的视频生成技术的发展,限制了其在需要高度真实感与可控性的应用(如高级场景模拟和感知系统训练)中的效用。为此,我们提出TrackDiffusion——一种新颖的视频生成框架,通过扩散模型实现细粒度的轨迹条件运动控制,可精确操控对象轨迹与交互,克服了尺度与连续性中断的常见局限。该框架的核心组件是实例增强器(instance enhancer),其明确保证了多个对象的跨帧一致性——这一关键因素在现有文献中常被忽视。此外,我们证明TrackDiffusion生成的视频序列可直接用作视觉感知模型的训练数据。据我们所知,这是首个应用带有轨迹条件的视频扩散模型,并证明生成帧可用于提升目标跟踪器性能的研究工作。