GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.
翻译:GPT-3.5和GPT-4是两种应用最广泛的大语言模型(LLM)服务。然而,这些模型的更新时间和方式并不透明。本文针对GPT-3.5和GPT-4的2023年3月版本与2023年6月版本,在多个不同任务上进行了评估:1)数学问题;2)敏感/危险问题;3)观点调查;4)多跳知识密集型问题;5)代码生成;6)美国医学执照考试;7)视觉推理。我们发现GPT-3.5和GPT-4的性能与行为会随时间发生显著变化。例如,GPT-4(2023年3月版本)在判别质数与合数方面表现合理(准确率84%),但同一模型在2023年6月版本中表现较差(准确率51%)。这一现象部分源于GPT-4对思维链提示遵循程度的下降。有趣的是,GPT-3.5在该任务中6月版本的表现远优于3月版本。与3月版本相比,GPT-4的6月版本回答敏感问题和观点调查问题的意愿降低;但GPT-4在6月版本的多跳问题回答上表现更优,而GPT-3.5在该任务中的性能出现下降。在代码生成方面,GPT-4和GPT-3.5的6月版本均比3月版本出现更多格式化错误。总体而言,我们的研究结果表明,“相同”LLM服务的行为可能在较短时间内发生剧烈变化,凸显了对大语言模型进行持续监测的必要性。