Scripted dialogues such as movie and TV subtitles constitute a widespread source of training data for conversational NLP models. However, the linguistic characteristics of those dialogues are notably different from those observed in corpora of spontaneous interactions. This difference is particularly marked for communicative feedback and grounding phenomena such as backchannels, acknowledgments, or clarification requests. Such signals are known to constitute a key part of the conversation flow and are used by the dialogue participants to provide feedback to one another on their perception of the ongoing interaction. This paper presents a quantitative analysis of such communicative feedback phenomena in both subtitles and spontaneous conversations. Based on dialogue data in English, French, German, Hungarian, Italian, Japanese, Norwegian and Chinese, we extract both lexical statistics and classification outputs obtained with a neural dialogue act tagger. Two main findings of this empirical study are that (1) conversational feedback is markedly less frequent in subtitles than in spontaneous dialogues and (2) subtitles contain a higher proportion of negative feedback. Furthermore, we show that dialogue responses generated by large language models also follow the same underlying trends and include comparatively few occurrences of communicative feedback, except when those models are explicitly fine-tuned on spontaneous dialogues.
翻译:脚本对话(如电影和电视字幕)是会话式自然语言处理模型训练数据的广泛来源。然而,这些对话的语言特征与自发互动语料库中观察到的特征存在显著差异。这种差异在交际反馈和基础现象(如回馈语、确认语或澄清请求)中尤为明显。此类信号已知构成对话流的关键部分,被对话参与者用于相互反馈其对当前互动的感知。本文对字幕和自发对话中的此类交际反馈现象进行了定量分析。基于英语、法语、德语、匈牙利语、意大利语、日语、挪威语和汉语的对话数据,我们提取了词汇统计量,并通过神经对话行为标注器获得分类结果。这项实证研究的两个主要发现是:(1)字幕中会话反馈的频率显著低于自发对话;(2)字幕中包含更高比例的负面反馈。此外,我们表明,大型语言模型生成的对话响应也遵循相同的基本趋势,并包含相对较少的交际反馈实例,除非这些模型在自发对话上进行了明确的微调。