Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (a framework toward \textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.
翻译:尽管近期取得了进展,实现符合影视级制作标准的角色动画仍然具有挑战性。现有方法能够将驱动视频中的运动迁移到参考图像上,但在涉及复杂运动和跨身份动画的真实场景中,往往难以保持结构保真度和时间一致性。在本工作中,我们提出了 \textbf{SCAIL}(一个通过上下文学习实现影视级角色动画的框架),该框架通过两项关键创新来应对这些挑战。首先,我们提出了一种新颖的三维姿态表征,提供了更鲁棒、更灵活的运动信号。其次,我们在扩散-Transformer架构中引入了全上下文姿态注入机制,使其能够对完整运动序列进行有效的时空推理。为了满足影视级要求,我们开发了一个确保多样性和质量的精选数据流水线,并建立了一个用于系统评估的综合基准。实验表明,\textbf{SCAIL} 实现了最先进的性能,并将角色动画向影视级的可靠性和真实感推进了一步。