SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker's unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific ``active joints'' conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the ``Frankenstein'' artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

翻译：摘要：尽管近期共语手势生成研究在实现韵律同步方面取得了显著进展，但如何合成兼具语义意义且忠实于说话人独特非语言风格的手势仍是一项待解决的挑战。语义手势（如象形姿态或指向动作）在统计上呈现稀疏性特征，这使得标准生成模型难以有效学习。我们提出SiGnature框架，专为兼顾精确语义控制与高保真风格保持的风格化语义手势生成任务而设计。与依赖隐式潜在表征的主流方法不同，SiGnature在显式关节旋转空间中运作。该设计催生了我们的核心贡献——关节运动融合（JMI），这是一种无需训练的推理机制，能够将任意外部运动序列（特别是野外场景中的语义手势）直接注入扩散过程。JMI可自动识别承载语义动作的特定"活动关节"并将其融入生成过程，同时依托扩散主干网络根据目标说话人预习风格合成包括姿态与流畅性在内的其余身体动力学特征。这使得无需重新训练即可实现包含复杂语义手势在内的任意运动的即插即用式集成，且不会产生拼接式方法典型的"弗兰肯斯坦效应"伪影。大量实验与感知研究证明，SiGnature在保持流畅自然的共语手势生成并保留说话人独特特征的同时，提供了卓越的语义运动控制能力，全面超越现有最优基线模型。