BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.

翻译：三维指挥动作生成旨在从音乐中合成细粒度的指挥者动作，在音乐教育、虚拟表演、数字人动画及人机共创领域具有广阔潜力。然而，该任务因两大挑战尚未得到充分探索：（1）缺乏大规模细粒度三维指挥数据集，（2）缺乏能够同时支持长序列高质量高效生成的有效方法。为应对数据限制，我们开发了面向质量的三维指挥动作采集流程，并构建了CM-Data——包含约10小时指挥动作数据的细粒度SMPL-X数据集。据我们所知，CM-Data是首个且最大的面向三维指挥动作生成的公开数据集。为应对方法局限，我们提出BiTDiff——一个基于BiMamba-Transformer混合模型架构的三维指挥动作生成新框架，该框架通过高效长序列建模与基于人体运动学分解的扩散生成策略实现高质量动作合成。具体而言，BiTDiff引入辅助物理一致性损失及手/身体特定正向运动学设计以增强细粒度动作建模，同时利用BiMamba实现内存高效的长序列时序建模，并通过Transformer实现跨模态语义对齐。此外，BiTDiff支持免训练的关节级动作编辑，可赋能下游人机交互设计。大量定量与定性实验表明，BiTDiff在CM-Data数据集上实现了三维指挥动作生成的最新性能。代码将在论文接收后公开。