How is ChatGPT's behavior changing over time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. We provide evidence that GPT-4's ability to follow user instructions has decreased over time, which is one common factor behind the many behavior drifts. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

翻译：GPT-3.5和GPT-4是两种应用最广泛的大语言模型服务。然而，这些模型的更新时机与方式并不透明。本文评估了2023年3月和6月版本的GPT-3.5和GPT-4在多项不同任务上的表现：1) 数学问题，2) 敏感/危险问题，3) 观点调查，4) 多跳知识密集型问题，5) 代码生成，6) 美国医学执照考试，7) 视觉推理。研究发现，GPT-3.5和GPT-4的性能与行为均可能随时间发生显著变化。例如，GPT-4（2023年3月版本）在判断质数与合数方面表现合理（准确率84%），但GPT-4（2023年6月版本）在同一任务上表现较差（准确率51%）。这一现象部分归因于GPT-4遵循思维链提示的能力下降。有趣的是，GPT-3.5在该任务中六月版本的表现显著优于三月版本。GPT-4在六月版本中回答敏感问题及观点调查问题的意愿低于三月版本。GPT-4在六月版本中多跳问题处理能力优于三月版本，而GPT-3.5在该任务上的性能有所下降。在代码生成任务中，GPT-4和GPT-3.5在六月版本中均出现更多格式错误。我们提供证据表明，GPT-4遵循用户指令的能力随时间下降，这是导致其行为漂移的常见因素之一。总体而言，我们的研究结果表明，"同一"LLM服务的行为可能在较短时间内发生重大变化，这凸显了对大语言模型进行持续监测的必要性。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日