Large Language Models (LLMs) have revolutionized the field of Natural Language Processing thanks to their ability to reuse knowledge acquired on massive text corpora on a wide variety of downstream tasks, with minimal (if any) tuning steps. At the same time, it has been repeatedly shown that LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution. In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available, on three algorithmic tasks characterized by the possibility to control the problem difficulty with two parameters. We compare the performance of GPT-4 with that of its predecessor (GPT-3.5) and with a variant of the Transformer-Encoder architecture recently introduced to solve similar tasks, the Neural Data Router. We find that the deployment of advanced prompting techniques allows GPT-4 to reach superior accuracy on all tasks, demonstrating that state-of-the-art LLMs constitute a very strong baseline also in challenging tasks that require systematic generalization.
翻译:大型语言模型(LLMs)凭借其能够将在海量文本语料上习得的知识迁移至多种下游任务,且仅需极少(甚至无需)调优步骤,已彻底改变了自然语言处理领域。然而,已有研究多次表明,LLMs缺乏系统性泛化能力,即无法将学习到的统计规律外推至训练分布之外。本研究对当前最先进的LLM之一——GPT-4,在三个算法任务上进行了系统性基准测试,这些任务的特点在于可通过两个参数控制问题难度。我们将GPT-4的性能与其前代模型(GPT-3.5)以及近期为解决类似任务而提出的Transformer-Encoder架构变体——神经数据路由器(Neural Data Router)进行了比较。研究发现,采用先进的提示技术能使GPT-4在所有任务上达到更高的准确率,这表明即使在需要系统性泛化的挑战性任务中,最先进的LLMs仍能构成极具竞争力的基准模型。