We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fr\'echet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.
翻译:本文提出ReCoM,一种用于生成高保真、可泛化且与语音同步的人体动作的高效框架。其核心创新在于循环嵌入Transformer(RET),该架构将动态嵌入正则化(DER)集成到Vision Transformer(ViT)核心中,以显式建模语音同步动作的动态特性。此架构能够实现联合时空依赖建模,从而通过连贯的动作合成提升手势的自然度与保真度。为增强模型鲁棒性,我们引入所提出的DER策略,使模型具备抗噪声与跨域泛化的双重能力,从而提升未见语音输入的零样本动作生成的自然度与流畅性。为缓解自回归推理固有的误差累积与有限自校正等局限,我们提出迭代重建推理(IRI)策略。IRI通过循环姿态重建优化动作序列,其驱动包含两个关键组件:(1)无分类器引导在无需辅助监督的情况下改善生成手势与真实手势的分布对齐;(2)时序平滑过程消除帧间突变并确保运动学连续性。在基准数据集上的大量实验验证了ReCoM的有效性,其在各项指标上均达到最先进性能。值得注意的是,该方法将Fr\'echet手势距离(FGD)从18.70降至2.48,表明动作真实感提升了86.7%。项目页面详见https://yong-xie-xy.github.io/ReCoM/。