This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.
翻译:本文提出了一种基于机器学习的以说话者为中心的情感人工智能方法,该方法能够预测异步视频学习中受众的情感参与度和声音吸引力,且仅依赖于说话者侧的情感表达。受可扩展、隐私保护的情感计算应用需求的启发,这种以说话者为中心的情感人工智能方法结合了两种不同的回归模型,利用大规模开放在线课程中开发的海量语料库,以实现富有情感吸引力的学习体验。预测情感参与度的回归模型通过整合面部动态、眼动特征、韵律特征和认知语义中的情感表达来构建,同时结合了第二个回归模型——该模型仅基于说话者侧的声学特征预测声音吸引力。值得注意的是,在说话者独立的测试集上,两个回归模型均取得了令人瞩目的预测性能(情感参与度的 R²=0.85,声音吸引力的 R²=0.88),证实了说话者侧的情感可以有效地代表聚合的受众反馈。本文提供了一种以说话者为中心的情感人工智能方法,并通过实证研究发现,说话者侧的多模态特征(包括声学特征)可以在无需使用受众侧输入信息的情况下前瞻性地预测受众反馈。