Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores. The project page of TrackGo can be found at: https://zhtjtcz.github.io/TrackGo-Page/
翻译:近年来,基于扩散模型的可控视频生成取得了显著进展。然而,在复杂场景中实现精确控制,包括细粒度物体部件、复杂运动轨迹以及连贯的背景运动,仍然是一个挑战。本文提出TrackGo,一种利用自由形式掩码和箭头进行条件视频生成的新方法。该方法为用户提供了一种灵活且精确的视频内容操控机制。我们还提出了用于控制实现的TrackAdapter,这是一种高效轻量的适配器,设计用于无缝集成到预训练视频生成模型的时序自注意力层中。该设计基于我们的观察:这些层的注意力图能够准确激活视频中与运动相对应的区域。我们的实验结果表明,通过TrackAdapter增强的新方法在FVD、FID和ObjMC等关键指标上达到了最先进的性能。TrackGo的项目页面位于:https://zhtjtcz.github.io/TrackGo-Page/