Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To achieve the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.
翻译:可控角色动画需要将驱动序列中的运动迁移至参考角色。现有方法严重依赖中间表征(包括用于表示运动的姿态骨架或表示环境的掩码背景),这不可避免会导致信息损失。为解决该问题,我们提出SCAIL-2框架,该框架绕开这些中间表征,实现**端到端**的角色动画。通过将驱动视频直接拼接至序列,模型可从输入视频中获取全部所需视觉信息。针对端到端数据的缺失问题,我们采用解耦条件统一角色动画子任务,并构建流水线合成了MotionPair-60K——一个包含角色动画异构任务的端到端运动迁移数据集。为实现统一化,我们采用上下文掩码条件化和模式特定RoPE,作为超越文本指令与原始视觉信息的软引导方法。针对精细区域存在的合成偏差,我们提出偏好感知DPO(Bias-Aware DPO)构建偏好项以修正误差。大量实验表明,本方法在多种角色动画任务中显著超越现有最优方法。合成数据子集及模型权重将在项目页面发布:https://teal024.github.io/SCAIL-2/。