The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.
翻译:近期语言模型在广泛任务中展现出的惊人表现,暗示它们具备一定程度的抽象推理能力。这些能力究竟是通用且可迁移的,还是仅适用于预训练阶段见过的特定任务?为厘清这些影响,我们提出了一种基于"反事实"任务变体的评估框架,这些变体偏离了标准任务隐含的默认假设。在包含11项任务的测试套件中,我们观察到模型在反事实变体上取得了不可忽视的表现,但同时也发现其性能相比默认条件显著且一致地下降。这表明,尽管当前语言模型可能在一定程度上具备抽象任务解决能力,但它们往往也依赖于狭隘且不可迁移的任务解决程序。这些结果启示我们需要更审慎地解读语言模型的表现,将行为的这些层面加以区分。