We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.
翻译:我们研究对话中说话者和听者表现出的言语及非言语语言学特征如何实时预测解释性互动中听者的理解状态。具体而言,我们考察了三种与认知负荷相关并假设与听者理解相关语言学线索:说话者话语的信息价值(以惊奇度操作化)和句法复杂度,以及听者互动注视行为的变化。基于对MUNDEX语料库(面对面对话式棋盘游戏解释)的统计分析,我们发现单个线索随听者理解水平而变化。听者状态(“理解”、“部分理解”、“不理解”和“误解”)由听者通过回顾性视频回忆法进行自我标注。后续分类实验涉及两种现成分类器和一种基于微调德语BERT的多模态分类器,结果表明,通常可以实现对这四种理解状态的预测,并且在将三个语言学线索与文本特征结合时预测效果得到改善。