Large Language Models (LLMs) have revolutionized the field of Natural Language Processing thanks to their ability to reuse knowledge acquired on massive text corpora on a wide variety of downstream tasks, with minimal (if any) tuning steps. At the same time, it has been repeatedly shown that LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution. In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available, on three algorithmic tasks characterized by the possibility to control the problem difficulty with two parameters. We compare the performance of GPT-4 with that of its predecessor (GPT-3.5) and with a variant of the Transformer-Encoder architecture recently introduced to solve similar tasks, the Neural Data Router. We find that the deployment of advanced prompting techniques allows GPT-4 to reach superior accuracy on all tasks, demonstrating that state-of-the-art LLMs constitute a very strong baseline also in challenging tasks that require systematic generalization.
翻译:大语言模型(LLMs)凭借其将海量文本语料获取的知识迁移到各类下游任务(仅需极少甚至无需微调)的能力,彻底革新了自然语言处理领域。然而,研究反复表明LLMs缺乏系统性泛化能力——这种能力允许将习得的统计规律外推到训练分布之外。本研究对当前最先进的LLMs之一GPT-4进行了系统性基准测试,聚焦于三个可通过双参数控制问题难度的算法任务。我们将GPT-4的表现与其前代模型GPT-3.5,以及近期为求解同类任务提出的Transformer-Encoder架构变体——神经数据路由器——进行了对比。研究发现,采用先进的提示技术可使GPT-4在所有任务中达到卓越精度,这表明最先进的LLMs在需要系统性泛化的挑战性任务中同样构成了强有力的基线模型。