An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization, not decomposition of words into morphemes. Does that matter? We carefully test this by comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data. Our results replicate previous findings and provide evidence that in the aggregate, predictions using BPE tokenization do not suffer relative to morphological and orthographic segmentation. However, a finer-grained analysis points to potential issues with relying on BPE-based tokenization, as well as providing promising results involving morphologically-aware surprisal estimates and suggesting a new method for evaluating morphological prediction.
翻译:一个伴随在心理语言学数据上使用大语言模型的重要假设尚未经过验证。基于大语言模型的预测依赖于子词分词,而非将词语分解为语素。这重要吗?我们通过将使用正字法、形态学以及BPE分词得到的惊喜度估计与阅读时间数据进行比较,仔细检验了这一点。我们的结果复现了以往的研究发现,并提供了证据表明,总体而言,使用BPE分词的预测相对于形态学和正字法分割并未表现更差。然而,更细致的分析揭示了依赖基于BPE分词可能存在的问题,同时为涉及形态学感知惊喜度估计的预测提供了有前景的结果,并提出了一种评估形态学预测的新方法。