Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.
翻译:人类情感表达本质上是动态、复杂且流动的,其特征是在言语交流过程中情感强度呈现平滑过渡。然而,以往音频驱动的说话头部生成方法大多忽视了对此类强度波动的建模,这通常导致生成的情感表达趋于静态。本文探究了语音过程中情感强度如何波动,并提出一种捕捉并生成这些细微变化以用于说话头部生成的方法。具体而言,我们开发了一个能够生成多种情感并精确控制强度等级的说话头部生成框架。这是通过学习一个连续情感潜在空间实现的,其中情感类型由潜在方向编码,情感强度则由潜在范数反映。此外,为捕捉动态的强度波动,我们采用了一个音频到强度的预测器,该预测器通过考虑反映强度的说话语调来实现。该预测器的训练信号通过我们提出的与情感无关的强度伪标注方法获得,无需逐帧强度标注。大量的实验与分析验证了我们所提方法在准确捕捉和复现说话头部生成中情感强度波动方面的有效性,从而显著增强了生成结果的表达力与真实感。