This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fr\'echet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation.
翻译:本文报告了第二届GENEA挑战赛,该赛事旨在对数据驱动的自动共语手势生成方法进行基准测试。参赛团队使用相同的语音和运动数据集构建手势生成系统。所有系统生成的运动通过标准化可视化流程渲染为视频,并在多个大规模众包用户研究中接受评估。不同于不同研究论文间的比较,本研究中结果的差异仅源于方法的不同,从而实现了系统间的直接对比。数据集基于18小时的全身体运动捕捉(包括手指动作),内容涉及多人参与的双人对话。十支团队分为两个层级参与挑战:全身体和上半身手势。针对每个层级,我们评估了手势运动的人类相似性及其对特定语音信号的适切性。我们的评估将人类相似性与手势适切性解耦,这在该领域一直是一个难题。评估结果显示,部分合成手势条件在人类相似性评分上显著高于3D人体运动捕捉数据——据我们所知,此前尚未有研究证明这一点。另一方面,所有合成运动与原始运动捕捉记录相比,对语音的适切性显著较低。我们还发现,传统客观指标在此大规模评估中与主观人类相似性评分的相关性不佳,唯一例外是弗雷歇手势距离(FGD),其肯德尔等级相关系数约为-0.5。基于挑战结果,我们为系统构建与评估提出了多项建议。