How is ChatGPT's behavior changing over time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.

翻译：GPT-3.5和GPT-4是两种最广泛使用的大型语言模型（LLM）服务。然而，这些模型的更新时间和方式并不透明。本文评估了2023年3月和2023年6月版本的GPT-3.5和GPT-4在四项不同任务上的表现：1）解决数学问题，2）回答敏感/危险问题，3）生成代码，以及4）视觉推理。我们发现，GPT-3.5和GPT-4的性能和行为会随时间发生显著变化。例如，GPT-4（2023年3月版本）在识别素数方面表现优异（准确率为97.6%），但GPT-4（2023年6月版本）在相同问题上的表现极差（准确率为2.4%）。有趣的是，GPT-3.5（2023年6月版本）在此任务上远优于GPT-3.5（2023年3月版本）。GPT-4在6月比3月更不愿意回答敏感问题，而GPT-4和GPT-3.5在6月生成代码时出现的格式错误均比3月更多。总体而言，我们的研究结果表明，同一LLM服务的行为可能在相对较短的时间内发生显著变化，这凸显了持续监测LLM质量的必要性。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日