Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy,Hendric Voss,Thanh Hoang-Minh,Mihail Tsakov,Teodor Nikolov,Zeyi Zhang,Tenglong Ao,Sicheng Yang,Shaoli Huang,Yongkang Cheng,M. Hamza Mughal,Rishabh Dabral,Kiran Chhatre,Christian Theobalt,Libin Liu,Stefan Kopp,Rachel McDonnell,Michael Neff,Taras Kucherenko,Youngwoo Yoon,Gustav Eje Henter

from arxiv, Accepted to CVPR 2026, Findings Track. 23 pages, 10 figures. The last two authors made equal contributions

We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par with more recent approaches; 2) previous findings of high speech-gesture alignment do not hold up under rigorous evaluation, even for specialised models; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. To drive standardisation and enable new evaluation research, we release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without requiring model reimplementation -- alongside our open-source rendering script, and 16,000 pairwise human preference votes collected for our benchmark.

翻译：本文回顾了自动语音驱动三维手势生成中的人工评估实践，发现存在缺乏标准化和频繁使用有缺陷实验设置的问题。这导致无法比较不同方法的性能，也无法明确当前技术的最高水平。为了解决评估设计中常见的不足，并规范未来手势生成工作中的应用户研究，我们针对广泛使用的BEAT2动作捕捉数据集，提出了一套详细的人工评估协议。基于该协议，我们开展大规模众包评估，对六个近期手势生成模型（每个模型均由原作者训练）在两个关键评估维度（动作真实感与语音-手势对齐）上进行排序。结果表明：1) 在BEAT2数据集上，动作真实感已成为趋于饱和的评估指标，较早期模型与最新方法性能相当；2) 即使针对专门化模型，先前报告的高语音-手势对齐结果在严格评估下无法复现；3) 该领域必须采用解耦的动作质量与多模态对齐评估方法，以建立准确的基准测试并推动进展。为促进标准化并支持新型评估研究，我们公开发布了基准测试模型生成的5小时合成动作数据、用户研究中超750段渲染视频素材（无需重新实现模型即可开展新评估）、开源渲染脚本，以及为本基准收集的16,000对人工偏好投票结果。