The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled evidence remains limited. We address this gap with controlled evaluations of co-speech gestures across motion sources, spanning seven representative avatar renderings used in contemporary research and application pipelines. Our results show that avatar and face presentation systematically shift perceptual judgments, and we provide recommendations for benchmarking gesture synthesis as well as for deploying virtual humans in human-facing applications.
翻译:创建逼真虚拟人类的能力已取得显著进展,此类角色广泛存在于娱乐、教育和健康等领域的多种应用中。作为交互式虚拟人类的核心要素,语音驱动的3D手势生成仍高度依赖感知评估,然而现有研究在评判生成运动时往往采用不同的化身外观与面部呈现方式。已有研究提示这些视觉选择可能干扰运动判断,但尚缺乏对照实验证据。我们通过控制手势来源与当代研究及应用中使用的七种代表性化身渲染,填补了这一空白。实验结果表明,化身与面部呈现会系统性改变感知判断结果,据此我们为手势合成的基准测试以及面向人类用户的虚拟人类部署提出了建议。