Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can 'understand' the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

翻译：网络规模的预训练数据带来了重要的评估挑战：如何区分模型在处理预训练数据中充分覆盖的案例时的语言能力，与对域外语言（尤其是预训练数据中罕见的动态、真实世界实例）的泛化能力。为此，我们构建了一项诊断性评估，利用构式语法（CxG）系统性地考察大语言模型（LLM）的自然语言理解能力。CxG 提供了基于心理语言学的泛化测试框架，因为它明确将句法形式与非词汇的抽象意义联系起来。我们提出的新颖推理评估数据集包含英语短语构式——该类构式已知能支持说话者超越常见实例进行抽象，从而理解并产生创造性表达。评估数据集借助 CxG 探讨两个核心问题：其一，模型能否理解那些在预训练数据中可能较少出现、但人类直觉上易于理解的句子的语义；其二，LLM 能否在句法形式相同但语义不同的构式上，正确部署相应的构式语义。结果表明，包括 GPT-o1 在内的最先进模型在第二个任务上性能下降超过 40%，暴露出其无法像人类那样从句法相同的形式中泛化出不同的构式语义。我们公开提供了这一新数据集及相关实验数据（包括提示词和模型响应）。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日

大语言模型的终身学习综述

专知会员服务

77+阅读 · 2024年6月15日

《大型语言模型持续学习》综述

专知会员服务

94+阅读 · 2024年4月26日

大型语言模型增强强化学习综述:概念、分类和方法

专知会员服务

57+阅读 · 2024年4月4日