Evaluating gesture-generation in a large-scale open challenge: The GENEA Challenge 2022

from arxiv, The first three authors made equal contributions and share joint first authorship. arXiv admin note: substantial text overlap with arXiv:2208.10441

This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fr\'echet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around -0.5. Based on the challenge results we formulate numerous recommendations for system building and evaluation.

翻译：本文报告了第二届GENEA挑战赛，该赛事旨在为数据驱动的自动伴语手势生成建立基准。参赛团队使用相同的语音和运动数据集构建手势生成系统。所有系统生成的运动均通过标准化可视化流程渲染为视频，并在多项大规模众包用户研究中进行了评估。与不同研究论文间的比较不同，此处的结果差异仅源于方法差异，从而实现了系统间的直接对比。数据集基于18小时全身运动捕捉数据（包含手指动作），记录了不同个体在双人对话中的交互行为。十支团队参与了两个层级的挑战：全身手势与上半身手势。针对每个层级，我们评估了手势运动的类人程度及其与特定语音信号的匹配度。我们的评估将类人程度与手势匹配度解耦——这曾是领域内难题。评估结果既具革命性又发人深省：部分合成条件在类人程度评分上显著优于真人运动捕捉数据。据我们所知，此前从未在高保真虚拟角色上观察到这一现象。另一方面，所有合成运动与语音的匹配度均远低于原始动作捕捉记录。我们还发现，在此大规模评估中，传统客观指标与主观类人程度评分相关性不佳——唯一例外是弗雷歇手势距离（FGD），其与主观评分的肯德尔秩相关系数约为-0.5。基于挑战结果，我们提出了多项关于系统构建与评估的建议。