评估刺激材料质量对大型语言模型语言性能研究的影响 (Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance)

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

翻译：近期利用大型语言模型检验刺激贫乏论的研究在不同句法现象上得出了相互矛盾的结果。本文提出假设：近期研究中使用的刺激材料特性（包括词汇歧义和结构复杂性）可能干扰模型性能表现。我们提出一种重新评估LLM句法预测能力的方法论，以GPT-2为研究对象。该方法包括：1）在先前使用的（经过筛选和未筛选的）刺激材料上建立基线，2）利用当前最先进的生成式LLM（Gemini 2.5 Pro Preview），通过旨在消除已识别干扰因素的语言学指导模板，生成全新的精制数据集。初步研究结果表明，与基线相比，GPT-2在这些精制的PG刺激材料上表现出显著提升的性能，这暗示刺激材料质量会显著影响基于惊异值的LLM句法能力评估结果。