MUSCLE: A Model Update Strategy for Compatible LLM Evolution

Large Language Models (LLMs) are frequently updated due to data or architecture changes to improve their performance. When updating models, developers often focus on increasing overall performance metrics with less emphasis on being compatible with previous model versions. However, users often build a mental model of the functionality and capabilities of a particular machine learning model they are interacting with. They have to adapt their mental model with every update -- a draining task that can lead to user dissatisfaction. In practice, fine-tuned downstream task adapters rely on pretrained LLM base models. When these base models are updated, these user-facing downstream task models experience instance regression or negative flips -- previously correct instances are now predicted incorrectly. This happens even when the downstream task training procedures remain identical. Our work aims to provide seamless model updates to a user in two ways. First, we provide evaluation metrics for a notion of compatibility to prior model versions, specifically for generative tasks but also applicable for discriminative tasks. We observe regression and inconsistencies between different model versions on a diverse set of tasks and model updates. Second, we propose a training strategy to minimize the number of inconsistencies in model updates, involving training of a compatibility model that can enhance task fine-tuned language models. We reduce negative flips -- instances where a prior model version was correct, but a new model incorrect -- by up to 40% from Llama 1 to Llama 2.

翻译：大型语言模型（LLMs）常因数据或架构变更而更新以提升性能。在更新模型时，开发者通常关注整体性能指标的提升，而较少强调与先前模型版本的兼容性。然而，用户往往会对所交互的特定机器学习模型的功能与能力形成心智模型。每次更新时他们都必须调整自己的心智模型——这是一项令人疲惫的任务，可能导致用户不满。在实践中，经过微调的下游任务适配器依赖于预训练的LLM基础模型。当这些基础模型更新时，这些面向用户的下游任务模型会出现实例回归或负翻转现象——即先前预测正确的实例现在被错误预测。即使下游任务的训练流程保持不变，这种现象依然会发生。本研究旨在通过两种方式为用户提供无缝的模型更新体验。首先，我们提出了一套评估指标，用于量化模型与先前版本的兼容性概念，该指标主要针对生成式任务设计，但也适用于判别式任务。我们在多种任务和模型更新场景中观察到不同模型版本间的回归与不一致现象。其次，我们提出了一种训练策略以最小化模型更新过程中的不一致性，该策略通过训练兼容性模型来增强经过任务微调的语言模型。在从Llama 1到Llama 2的更新过程中，我们将负翻转（即先前版本预测正确而新版本预测错误的实例）减少了高达40%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日