While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs' generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs' generalization capabilities.
翻译:尽管大型语言模型(LLMs)在理解复杂查询和执行精密任务方面展现出卓越能力,但其泛化能力常与记忆效应深度纠缠,亟需更精确的评估方法。为应对这一挑战,我们提出Scylla——一个动态评估框架,用于量化衡量LLMs的泛化能力。Scylla通过5个复杂度层级下的20项任务,评估模型在分布内(ID)与分布外(OOD)数据上的表现,从而解耦泛化与记忆效应。大量实验揭示出任务复杂度与ID/OOD数据性能差距之间存在非单调关系,我们将其定义为泛化谷现象。具体而言,该现象揭示了一个关键阈值——称为临界复杂度——在此阈值处模型对非泛化行为的依赖达到峰值,标志着LLMs泛化能力的理论上限。随着模型规模增大,临界复杂度向更高任务复杂度层级偏移,表明更大规模的模型能在过度依赖记忆之前处理更复杂的推理任务。基于Scylla框架与临界复杂度概念,我们对28个LLMs进行了系统评测,包括LLaMA和Qwen系列等开源模型,以及Claude和GPT等闭源模型,从而提供更稳健的评估基准,并建立对LLMs泛化能力更清晰的认知体系。