The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech gesture representations and produced motions, which created seemingly natural but often unconvincing gestures during human assessment. We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline. The resulting codebook vectors serve as both input and output in our framework, forming the basis for the generation and reconstruction of gestures. By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures that closely replicate human movement and behavior, while simultaneously avoiding artifacts in the generation process. We evaluate our approach by comparing it with established methods for generating co-speech gestures as well as with existing datasets of human behavior. We also perform an ablation study to assess our findings. The results show that our approach outperforms the current state of the art by a clear margin and is partially indistinguishable from human gesturing. We make our data pipeline and the generation framework publicly available.
翻译:生成真实且语境相关的共语手势是多模态人工智能体创作中一项具有挑战性且日益重要的任务。先前的方法侧重于学习共语手势表征与生成运动之间的直接对应关系,这虽然产生了看似自然的手势,但在人工评估中往往缺乏令人信服的效果。我们提出一种基于生成对抗网络与量化流水线的预训练部分手势序列方法。由此产生的码本向量同时作为框架的输入与输出,构成手势生成与重建的基础。通过学习潜在空间表征的映射(而非直接映射至向量表征),该框架能够生成高度真实且富有表现力的手势,这些手势紧密复现人类运动与行为,同时避免生成过程中的伪影。我们将所提方法与现有共语手势生成方法及人类行为数据集进行对比评估,并通过消融研究验证实验结果。结果表明,我们的方法显著超越当前最优水平,且部分手势结果与人类手势难以区分。我们将数据流水线与生成框架公开提供。