Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
翻译:定制化文本到视频生成旨在生成融合用户指定主体身份或运动模式的高质量视频。然而,现有方法主要侧重于对单一概念(主体身份或运动模式)进行个性化定制,限制了其在处理具有期望运动模式的多个主体时的有效性。为应对这一挑战,我们提出了一个统一框架VideoMage,用于对多个主体及其交互运动进行视频定制。VideoMage采用主体与运动LoRA模块,从用户提供的图像和视频中捕获个性化内容,并结合外观无关的运动学习方法,以解耦运动模式与视觉外观。此外,我们开发了一种时空组合方案,用于引导期望运动模式下主体间的交互。大量实验表明,VideoMage优于现有方法,能够生成具有一致主体身份与交互的、连贯且用户可控的视频。