Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the ``explicit'' and ``discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.
翻译:聆听者头部生成聚焦于根据说话者传递的信息,生成聆听者的非言语行为(如微笑)。生成此类回应时的一个重大挑战是对话中细粒度面部表情的非确定性特征,其变化取决于说话者和聆听者双方的情感与态度。为解决此问题,我们提出情感聆听者画像(ELP),将每个细粒度面部运动视为若干离散运动码本的组合,并显式建模对话中不同情感下的动作概率分布。得益于“显式”与“离散”的设计,我们的ELP模型不仅能通过从学习到的分布中采样,自动为给定说话者生成自然且多样的回应,还能生成具有预定态度的可控回应。在多项量化指标下,我们的ELP相较先前方法展现出显著提升。