A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its surprisal, i.e. its negative log-probability given a context. While evidence supporting the predictions of Surprisal Theory have been replicated widely, most have focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.
翻译:心理语言学的一个基本发现是,可预测性较低的词语需要更长的处理时间。这一现象的理论解释之一是惊喜理论(Hale, 2001;Levy, 2008),该理论将词语的可预测性量化为其惊喜度,即给定上下文后的负对数概率。尽管支持惊喜理论预测的证据已被广泛重复验证,但大多数研究聚焦于一个非常狭窄的数据范围:以英语为母语的读者阅读英语文本。事实上,目前尚缺乏全面的多语言分析。我们通过研究11种不同语言(分属五个语系)中惊喜度与阅读时间之间的关系,填补了当前文献中的这一空白。通过从基于单语和多语语料库训练的语言模型中提取估计值,我们检验了与惊喜理论相关的三个预测:(i)惊喜度是否能够预测阅读时间;(ii)预期惊喜度(即上下文熵)是否能够预测阅读时间;(iii)惊喜度与阅读时间之间的连接函数是否为线性。我们发现,这三个预测在跨语言条件下均成立。通过聚焦于更具多样性的语言集合,我们认为这些结果为信息论与跨语言渐进语言处理之间提供了迄今为止最有力的关联。