Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes. However there are still obvious limitations when facing complex human body motions that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we propose a concise yet powerful DiT-based human animation generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Furthermore, we introduce the Open-HyperMotionX Dataset and HyperMotionX Bench, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. The codes, model weights, and dataset have been made publicly available at https://vivocameraresearch.github.io/hypermotion/
翻译:扩散模型的最新进展显著提升了条件视频生成的质量,特别是在姿态引导的人体图像动画任务中。现有方法虽能生成常规运动和静态场景下的高保真、时间一致性动画序列,但在处理包含高度动态、非标准运动的复杂人体运动时仍存在明显局限,且缺乏用于评估复杂人体运动动画的高质量基准。为应对这一挑战,我们提出了一种简洁而强大的基于DiT的人体动画生成基线,并设计了空间低频增强RoPE模块——通过引入可学习频率缩放因子,选择性增强低频空间特征建模。此外,我们发布了Open-HyperMotionX数据集和HyperMotionX基准,提供高质量的人体姿态标注与精选视频片段,用于评估和改进复杂人体运动条件下的姿态引导人体图像动画模型。我们的方法在高度动态人体运动序列中显著提升了结构稳定性与外观一致性。大量实验证明了该数据集与所提方法在提升复杂人体运动图像动画生成质量方面的有效性。代码、模型权重与数据集已公开于 https://vivocameraresearch.github.io/hypermotion/