The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.
翻译:近期语言模型在广泛任务中展现出的卓越性能,暗示它们具备一定程度的抽象推理能力。这些能力究竟是通用可迁移的,还是针对预训练阶段所见特定任务的特化技能?为厘清这一问题,我们提出了一种基于"反事实"任务变体的评估框架——这类任务偏离了标准任务背后的默认假设。在涵盖11项任务的测试套件中,我们发现模型在反事实变体上虽表现出非平凡性能,但相比默认条件,其表现仍出现显著且一致的下降。这表明,当前语言模型虽可能具备一定程度的抽象任务解决能力,但它们往往也依赖狭隘且不可迁移的任务求解过程。这些发现启示我们需更审慎地解读语言模型的表现,并厘清行为中这些不同层面的特征。