The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
翻译:大语言模型(LLMs)的快速发展已使标准化评估基准成为模型比较的主要工具。然而,由于其对输入提示中浅层变化的敏感性,其可靠性日益受到质疑。本文研究了受控的、真值条件等价的词汇与句法扰动如何影响23个当代LLM在三个基准(MMLU、SQuAD和AMEGA)上的绝对性能和相对排名。我们采用两个基于语言学原理的流程来生成保持语义的变体:一个流程通过同义词替换实现词汇变化,另一个流程利用依存句法分析来确定适用的句法转换。结果表明,词汇扰动在所有模型和几乎所有任务中均持续引发显著且具有统计意义的性能下降,而句法扰动的影响则更为异质,偶尔甚至能改善结果。两种扰动类型都会在复杂任务上动摇模型排行榜的稳定性。此外,模型的鲁棒性并未随模型规模扩大而持续提升,显示出强烈的任务依赖性。总体而言,研究结果表明,LLMs更依赖于表层的词汇模式而非抽象的语言能力,这强调了将鲁棒性测试作为LLM评估标准组成部分的必要性。