The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.
翻译:音频到三维手势(A2G)任务在虚拟现实和计算机图形学等领域具有巨大的应用潜力。然而,当前的评估指标(如Fréchet手势距离或节拍一致性)无法有效反映人类对生成的三维手势的偏好。为解决这一问题,探索人类偏好以及面向AI生成的三维人体手势的客观质量评估指标正变得日益重要。本文介绍了Ges-QA数据集,该数据集包含1,400个样本,每个样本具有手势质量和音频-手势一致性多维评分。此外,我们收集了二分类标签,用于判断生成的手势是否与音频的情感相匹配。基于Ges-QA数据集,我们提出了一种多模态transformer神经网络,该网络包含视频、音频和三维骨骼三个分支,能够对A2G内容进行多维度评分。对比实验结果和消融研究表明,Ges-QAer在我们数据集上取得了最优性能。