Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.

翻译：音频到三维手势（A2G）任务在虚拟现实和计算机图形学等领域具有巨大的应用潜力。然而，当前的评估指标（如Fréchet手势距离或节拍一致性）无法有效反映人类对生成的三维手势的偏好。为解决这一问题，探索人类偏好以及面向AI生成的三维人体手势的客观质量评估指标正变得日益重要。本文介绍了Ges-QA数据集，该数据集包含1,400个样本，每个样本具有手势质量和音频-手势一致性多维评分。此外，我们收集了二分类标签，用于判断生成的手势是否与音频的情感相匹配。基于Ges-QA数据集，我们提出了一种多模态transformer神经网络，该网络包含视频、音频和三维骨骼三个分支，能够对A2G内容进行多维度评分。对比实验结果和消融研究表明，Ges-QAer在我们数据集上取得了最优性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文本、视觉与语音生成的自动化评估方法综述

专知会员服务

20+阅读 · 2025年6月15日

IMAGINE-E：最先进文本到图像模型的图像生成智能评估

专知会员服务

13+阅读 · 2025年2月3日

《AI生成视频评估综述》

专知会员服务

28+阅读 · 2024年10月30日

《基于边缘智能的可穿戴多模态手势识别》美空军2023最新38页报告

专知会员服务

50+阅读 · 2023年4月28日