Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the ``explicit'' and ``discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.
翻译:聆听者头部生成的核心任务是根据说话者传递的信息,生成聆听者的非言语行为(如微笑)。生成此类响应时面临的主要挑战在于,对话过程中精细面部表情的非确定性特征,这些表情会随着说话者和聆听者的情绪及态度而变化。为解决这一问题,我们提出了情感聆听者肖像(Emotional Listener Portrait,简称ELP),该方法将每个精细面部运动视为若干离散运动码字的组合,并明确建模对话中不同情绪下运动的概率分布。得益于“显式”与“离散”的设计,我们的ELP模型不仅能通过从学习到的分布中采样,自动为给定说话者生成自然且多样化的响应,还能基于预设态度生成可控的响应。在多项量化指标下,与先前方法相比,我们的ELP展现出显著改进。