FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?

There is great interest in fine-tuning frontier large language models (LLMs) to inject new information and update existing knowledge. While commercial LLM fine-tuning APIs from providers such as OpenAI and Google promise flexible adaptation for various applications, the efficacy of fine-tuning remains unclear. In this study, we introduce FineTuneBench, an evaluation framework and dataset for understanding how well commercial fine-tuning APIs can successfully learn new and updated knowledge. We analyze five frontier LLMs with commercially available fine-tuning APIs, including GPT-4o and Gemini 1.5 Pro, on their effectiveness in two settings: (1) ingesting novel information, such as recent news events and new people profiles, and (2) updating existing knowledge, such as updated medical guidelines and code frameworks. Our results reveal substantial shortcomings in all the models' abilities to effectively learn new information through fine-tuning, with an average generalization accuracy of 37% across all models. When updating existing knowledge, such as incorporating medical guideline updates, commercial fine-tuning APIs show even more limited capability (average generalization accuracy of 19%). Overall, fine-tuning GPT-4o mini is the most effective for infusing new knowledge and updating knowledge, followed by GPT-3.5 Turbo and GPT-4o. The fine-tuning APIs for Gemini 1.5 Flesh and Gemini 1.5 Pro are unable to learn new knowledge or update existing knowledge. These findings underscore a major shortcoming in using current commercial fine-tuning services to achieve reliable knowledge infusion in common scenarios. We open source the FineTuneBench dataset at https://github.com/kevinwu23/StanfordFineTuneBench.

翻译：通过微调前沿大语言模型（LLMs）以注入新信息并更新现有知识引起了广泛关注。尽管来自OpenAI和Google等供应商的商用LLM微调API承诺为各种应用提供灵活的适配能力，但微调的实际效果仍不明确。在本研究中，我们提出了FineTuneBench，这是一个用于评估商用微调API成功学习新知识和更新知识能力的评估框架与数据集。我们分析了五款提供商用微调API的前沿LLM（包括GPT-4o和Gemini 1.5 Pro）在两种场景下的有效性：（1）吸收新颖信息，例如近期新闻事件和新的人物简介；（2）更新现有知识，例如更新的医疗指南和代码框架。我们的研究结果揭示了所有模型通过微调有效学习新信息的能力存在显著不足，所有模型的平均泛化准确率仅为37%。在更新现有知识（例如整合医疗指南更新）时，商用微调API表现出更为有限的能力（平均泛化准确率为19%）。总体而言，微调GPT-4o mini在注入新知识和更新知识方面最为有效，其次是GPT-3.5 Turbo和GPT-4o。Gemini 1.5 Flash和Gemini 1.5 Pro的微调API无法学习新知识或更新现有知识。这些发现突显了在当前常见场景中使用商用微调服务实现可靠知识注入存在重大缺陷。我们在https://github.com/kevinwu23/StanfordFineTuneBench开源了FineTuneBench数据集。