Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between ``perception'' models that understand motion from video but only output text, and ``generation'' models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.
翻译:人体运动分析任务(如时序三维姿态估计、运动预测与运动插补)在计算机视觉领域扮演着关键角色。然而,现有范式存在严重的碎片化问题。首先,该领域被割裂为两类模型:仅能从视频理解运动却仅输出文本的“感知”模型,以及无法从原始视觉输入进行感知的“生成”模型。其次,生成式多模态大语言模型通常局限于使用稠密参数化SMPL模型处理单帧静态姿态,无法处理时序运动。再者,现有运动词汇表仅基于骨架数据构建,割裂了与视觉领域的关联。为应对这些挑战,我们提出“超人”——一个连接视觉感知与基于骨架的时序运动生成的统一框架。我们的解决方案包含两个核心部分。首先,为克服模态割裂问题,我们提出视觉引导运动分词器。该模块利用三维骨架与视觉数据间的自然几何对齐特性,开创性地实现双模态鲁棒联合学习,构建出统一的跨模态运动词汇表。其次,基于该运动语言,我们训练单一统一的多模态大语言模型架构以处理所有任务。该模块灵活处理多样化时序输入,统一实现了从视频的三维骨架姿态估计(感知)到基于骨架的运动预测与插补(生成)。在Human3.6M等标准基准上的大量实验表明,我们的统一方法在所有运动任务中均达到最优或具有竞争力的性能,这为使用骨架的生成式运动分析展示了更高效且可扩展的技术路径。