Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt. Our project page with demo videos is available at https://danny-nus.github.io/PersonaGest/
翻译:共语手势生成旨在合成与语音语义一致且忠实于用户指定手势风格的真实身体动作。现有基于VQ-VAE的共语手势生成方法虽提升了生成质量,但未能将语义结构编码到运动表示中,也未能明确分离内容与风格,从而限制了语义一致性与个性化保真度。我们提出PersonaGest,一个解决上述两个局限性的两阶段框架。在第一阶段,语义引导的RVQ-VAE在残差量化结构中解耦运动内容与手势风格,其中语义感知运动码本(SMoC)按手势语义组织内容码本,并通过对比学习进一步强化内容-风格分离。在第二阶段,掩码生成式Transformer通过语义感知重掩码策略生成内容令牌,随后以参考运动提示为条件的级联风格残差Transformer实现风格控制。大量实验表明,该方法在客观指标和感知用户研究中均达到最先进性能,且与参考提示的风格一致性显著。项目页面及演示视频请访问https://danny-nus.github.io/PersonaGest/