Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.
翻译:近期研究表明,大语言模型(LLMs)能够生成词汇及多词表达的心理语言学规范估计值(如情感效价、唤醒度、具体性),这些估计值与人类判断具有相关性。此类估计值通过以零样本方式向LLM提出类似于人类研究中使用的问题提示而获得。与此同时,对于词汇决策时间或习得年龄等其他规范,LLMs需要进行监督微调才能获得与真实值相符的结果。本文将该方法拓展至先前未被研究的句子记忆性与阅读时间特征,这些特征涉及句子语境中多个词汇间的关联关系。实验结果表明,通过微调,模型能够提供与人类推导规范相关的估计值,其预测能力超过可解释基线预测器,证明LLMs蕴含关于句子层面特征的有效信息。同时,我们的研究显示出零样本与少样本性能存在显著差异,这进一步表明将LLM提示作为人类认知测量替代指标时需要审慎对待。