Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. "In-the-wild" datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with speech. Concurrent developments in lifting models enable the conversion of these 2D sequences into 3D gesture databases. However, it is important to note that the 3D poses estimated from the 2D extracted poses are, in essence, approximations of the ground-truth, which remains in the 2D domain. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions - a topic that, to our knowledge, remains largely unexplored. Our study examines the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D. We perform an objective evaluation using widely used metrics in the gesture generation field as well as a user study to qualitatively evaluate the different approaches.
翻译:伴随语音手势是交流的基础。近年来深度学习技术的发展促进了为具身对话代理生成逼真、同步的伴随语音手势。通过人体姿态检测技术从YouTube等平台聚合视频内容构建的"野外"数据集,提供了与语音对齐的二维骨骼序列这一可行解决方案。与此同时,姿态提升模型的发展使得将这些二维序列转换为三维手势数据库成为可能。然而需要指出的是,从提取的二维姿态估计得到的三维姿态本质上是对真实姿态(仍属于二维域)的近似。这种区别引发了关于手势表示维度对生成动作质量影响的疑问——据我们所知,这一主题在很大程度上尚未被探索。本研究考察了使用二维或三维关节坐标作为训练数据对语音到手势深度生成模型性能的影响。我们采用提升模型将生成的二维姿态序列转换为三维,并评估直接生成的三维手势与先生成二维再转换为三维的手势之间的差异。我们使用手势生成领域广泛采用的指标进行客观评估,并通过用户研究对不同方法进行定性评估。