Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.
翻译:人体运动合成技术发展迅速,但逼真的手部运动与双手交互仍未被充分探索。全身模型常缺失驱动灵巧行为、手指关节运动、接触时序及双手协调的细粒度线索,现有资源也缺乏捕捉精细手指动力学与协作的高保真双手交互序列。为填补这一空白,我们提出HandX——一个涵盖数据、标注与评估的统一基础框架。我们整合并筛选现有数据集以保证质量,同时针对未被充分覆盖的双手交互场景,采集了包含详细手指动力学的新动作捕捉数据集。为实现可扩展标注,我们引入解耦策略:先提取代表性运动特征(如接触事件与手指屈曲),再利用大语言模型的推理能力生成与这些特征对齐的细粒度、语义丰富的描述。基于所得数据与标注,我们以多种条件模式对扩散模型与自回归模型进行了基准测试。实验结果表明,我们新提出的手部聚焦指标支撑了高质量灵巧运动生成。此外,我们观察到明确的扩展趋势:在更大、更高质量数据集上训练的更大模型能生成语义更连贯的双手运动。本数据集已公开以支持未来研究。