Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research on videoconferencing user experience by showing that multimodal machine learning can be used to identify rare moments of negative user experience for further study or mitigation.
翻译:视频会议现已成为专业与非正式场合中频繁使用的交流模式,但其往往缺乏面对面交谈的流畅性与愉悦感。本研究利用多模态机器学习预测视频会议中的负面体验时刻。我们从RoomReader语料库中采样了数千个短视频片段,提取音频嵌入、面部动作与身体运动特征,以训练识别低会话流畅度、低愉悦度及分类会话事件(反馈性回应、打断或间隙)的模型。我们的最佳模型在留出视频会议会话上达到了高达0.87的ROC-AUC值,其中跨领域通用音频特征被证明最为关键。本研究表明,多模态音视频信号能有效预测高层次主观会话结果。此外,这项工作通过展示多模态机器学习可用于识别罕见用户负面体验时刻以供进一步研究或缓解,为视频会议用户体验研究作出了贡献。