This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
翻译:本文提出了一种新颖的数据集构建流程,该方法从视频中采样帧对,并利用多模态大语言模型生成编辑指令,用于训练基于指令的图像操控模型。视频帧天然保持了主体与场景的同一性,确保了编辑过程中的内容一致性。此外,视频数据捕捉了多样且自然的动态变化——如非刚性主体运动和复杂相机运动——这些动态难以通过其他方式建模,使其成为可扩展数据集构建的理想来源。基于此方法,我们构建了一个新数据集用于训练InstructMove模型,该模型能够执行基于指令的复杂操控任务,这些任务难以通过合成生成的数据集实现。我们的模型在调整主体姿态、重新排列元素以及改变相机视角等任务上均展现出最先进的性能。