Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues

Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents' effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.

翻译：针对心理健康筛查、诊断和治疗资源严重短缺的问题，开发创新解决方案以提升治疗支持的可及性与有效性已成为关键挑战。具备先进交互能力的具身智能体作为传统照护方法的低成本补充手段展现出巨大潜力。此类智能体的效能核心在于其模拟非语言行为（如回应式微笑）的能力——这类行为在治疗场景中对于建立默契与理解至关重要，但目前研究尚不充分。为提升具身智能体的情感共鸣构建能力，我们针对心理健康、疾病、人际关系等亲密面对面对话视频中的回应式微笑进行了标注。基于说话者与倾听者行为均会影响回应式微笑持续时间与强度的假设，我们利用语音韵律、语言特征及人口统计学信息，发现其中包含预测微笑强度的显著因子。据此，我们将具身智能体的回应式微笑生成定义为生成问题。基于注意力机制的生成模型表明，引入倾听者信息相较于基线说话者中心生成方法可提升性能。利用微笑强度显著预测因子进行条件化生成，能在生成质量的实证指标上获得统计显著的改进。通过将生成微笑迁移至具身智能体的用户研究表明，相较于无回应式微笑的智能体，具备该能力的智能体被感知为更接近人类，并成为非亲密对话场景中更具吸引力的替代方案。