Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/
翻译:尽管近期语音驱动的三维面部动画领域已取得显著进展,但面部不可或缺的组成部分——视线——的语音驱动动画却未被现有研究充分关注。这主要源于语音与视线之间的弱相关性,以及视听-视线数据的稀缺性,使得仅从语音生成三维视线运动极具挑战性。本文提出一种新颖的数据驱动方法,能够生成与语音协调一致的多样化三维视线运动。为实现这一目标,我们首先构建了一个视听-视线数据集,包含约14小时同时具备高质量视线运动、头部运动与面部运动的音频-网格序列。该运动数据通过对现有视听数据集中的视频进行轻量级视线拟合与面部重建而获得。随后,我们设计了一种新颖的语音-运动转换框架,其中头部运动与视线运动均从语音联合生成,但分别建模于两个独立的潜在空间中。此设计源于眼球旋转范围小于头部旋转范围的生理学认知。通过将语音嵌入映射至这两个潜在空间,语音与非语言运动间弱相关性的建模难度得以降低。最终,我们集成了语音驱动的三维面部运动生成器的TalkingEyes系统,能够从语音中同步合成视线运动、眨眼、头部运动及面部运动。大量定量与定性评估表明,所提方法在从语音生成多样化且自然的三维视线运动方面具有优越性。本文项目页面为:https://lkjkjoiuiu.github.io/TalkingEyes_Home/