Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors.
翻译:实时句子理解对工作记忆施加了显著负荷,因为理解者必须维持上下文信息以预测未来输入。虽然此类负荷的度量在心理语言学理论中发挥了重要作用,但其形式化主要依赖于符号语法体系,即对句法预测赋予离散且统一的成本。本研究基于信息论形式化提出了一种处理存储成本的度量方法,将其定义为在不确定性条件下,先前词汇对未来上下文所承载的信息量。与以往基于语法的离散度量不同,本度量是连续且理论中立的,并可通过预训练的神经语言模型进行估计。通过三项英语分析验证了该方法的有效性:我们的度量(i)重现了中心嵌套结构与关系从句中已知的处理不对称性,(ii)在句法标注语料库中与基于语法的存储成本存在相关性,且(iii)在两个大规模自然主义数据集中,相较于采用传统信息论预测因子的基线模型,能额外解释阅读时间的方差。