Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies

Large Language Models (LLMs) have revolutionized the field of Natural Language Processing thanks to their ability to reuse knowledge acquired on massive text corpora on a wide variety of downstream tasks, with minimal (if any) tuning steps. At the same time, it has been repeatedly shown that LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution. In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available, on three algorithmic tasks characterized by the possibility to control the problem difficulty with two parameters. We compare the performance of GPT-4 with that of its predecessor (GPT-3.5) and with a variant of the Transformer-Encoder architecture recently introduced to solve similar tasks, the Neural Data Router. We find that the deployment of advanced prompting techniques allows GPT-4 to reach superior accuracy on all tasks, demonstrating that state-of-the-art LLMs constitute a very strong baseline also in challenging tasks that require systematic generalization.

翻译：大型语言模型（LLMs）凭借其能够将在海量文本语料上习得的知识迁移至多种下游任务，且仅需极少（甚至无需）调优步骤，已彻底改变了自然语言处理领域。然而，已有研究多次表明，LLMs缺乏系统性泛化能力，即无法将学习到的统计规律外推至训练分布之外。本研究对当前最先进的LLM之一——GPT-4，在三个算法任务上进行了系统性基准测试，这些任务的特点在于可通过两个参数控制问题难度。我们将GPT-4的性能与其前代模型（GPT-3.5）以及近期为解决类似任务而提出的Transformer-Encoder架构变体——神经数据路由器（Neural Data Router）进行了比较。研究发现，采用先进的提示技术能使GPT-4在所有任务上达到更高的准确率，这表明即使在需要系统性泛化的挑战性任务中，最先进的LLMs仍能构成极具竞争力的基准模型。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日