Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy,Hendric Voss,Thanh Hoang-Minh,Mihail Tsakov,Teodor Nikolov,Zeyi Zhang,Tenglong Ao,Sicheng Yang,Shaoli Huang,Yongkang Cheng,M. Hamza Mughal,Rishabh Dabral,Kiran Chhatre,Christian Theobalt,Libin Liu,Stefan Kopp,Rachel McDonnell,Michael Neff,Taras Kucherenko,Youngwoo Yoon,Gustav Eje Henter

from arxiv, Accepted to CVPR 2026, Findings Track. 23 pages, 10 figures. The last two authors made equal contributions

We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is.In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par with more recent approaches; 2) previous findings of high speech-gesture alignment do not hold up under rigorous evaluation, even for specialised models; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. To drive standardisation and enable new evaluation research, we release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without requiring model reimplementation -- alongside our open-source rendering script, and 16,000 pairwise human preference votes collected for our benchmark.

翻译：我们回顾了自动语音驱动三维手势生成中的人类评估实践，发现存在缺乏标准化及频繁使用有缺陷的实验设计的问题。这导致无法比较不同方法的优劣，也无法明确当前最优技术。为解决评估设计中常见的缺陷，并规范未来手势生成研究中的用户实验，我们针对广泛使用的BEAT2动作捕捉数据集提出了详细的人类评估协议。运用该协议，我们开展了大规模众包评估，对六个最新手势生成模型（均由原作者训练）在两个关键评估维度（动作真实感与语音-手势对齐度）上进行排序。结果表明：1）在BEAT2数据集上，动作真实感已成为饱和性评估指标，旧模型与新方法的性能相当；2）即使是专用模型，以往宣称的高语音-手势对齐度在严格评估下也无法得到验证；3）该领域必须采用解耦的动作质量与多模态对齐评估方法，才能实现精准基准测试以推动进展。为促进标准化并支持新的评估研究，我们发布了来自基准测试模型的五小时合成动作数据、用户实验中超过750个渲染视频刺激（无需重新实现模型即可进行新评估）、开源渲染脚本，以及本次基准测试收集的16,000对人工偏好投票数据。