In psycholinguistics, the creation of controlled materials is crucial to ensure that research outcomes are solely attributed to the intended manipulations and not influenced by extraneous factors. To achieve this, psycholinguists typically pretest linguistic materials, where a common pretest is to solicit plausibility judgments from human evaluators on specific sentences. In this work, we investigate whether Language Models (LMs) can be used to generate these plausibility judgements. We investigate a wide range of LMs across multiple linguistic structures and evaluate whether their plausibility judgements correlate with human judgements. We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine, whereas other LMs correlate well with humans on commonly used syntactic structures. We then test whether this correlation implies that LMs can be used instead of humans for pretesting. We find that when coarse-grained plausibility judgements are needed, this works well, but when fine-grained judgements are necessary, even GPT-4 does not provide satisfactory discriminative power.
翻译:在心理语言学中,受控材料的创建至关重要,以确保研究结果完全归因于预期的操控变量,而非受到外部因素的干扰。为此,心理语言学家通常会对语言材料进行预测试,其中常见的做法是邀请人类评估者对特定句子进行合理性判断。本研究探讨语言模型(LM)能否用于生成此类合理性判断。我们考察了涵盖多种语言结构的多类语言模型,并评估其合理性判断与人类判断的相关性。研究发现,GPT-4的合理性判断与人类判断在考察的所有语言结构上均高度相关,而其他语言模型仅在常用的句法结构上与人类判断存在良好相关性。随后我们检验了这种相关性是否意味着语言模型可替代人类进行预测试。结果表明,当需要粗粒度合理性判断时,该方法表现良好;但在需要细粒度判断时,即便是GPT-4也无法提供令人满意的区分能力。