Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
翻译:语言模型在计算心理语言学中被广泛用于检验相关理论,这些理论将语言模型下兴趣区域(字符子串)的负对数概率(即惊奇度)与读者所经历的认知成本(例如通过注视该区域的时长来量化)联系起来。然而,现代语言模型在心理语言学研究中的应用因训练过程中采用词元化作为中间步骤而变得复杂。这种做法导致语言模型处理的是词元序列而非字符序列。令人困扰的是,兴趣区域通常与这些词元序列不对齐。本文主张,在将词元级语言模型用于心理语言学研究以计算兴趣区域的惊奇度之前,应将其(近似)边缘化为字符级语言模型;随后,边缘化后的字符级语言模型可用于计算任意字符子串(我们称之为焦点区域)的惊奇度,实验者可能希望将其用作预测因子。我们提出的将词元级模型边缘化为字符级模型的方案,独立于词元化方案解决了这种不对齐问题。实证研究表明,多种焦点区域的惊奇度是比兴趣区域本身的惊奇度更优的心理测量预测因子。