As fine-tuning large language models (LLMs) becomes increasingly prevalent, users often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: how do consumers verify that fine-tuning services are performed correctly? For instance, a service provider could claim to fine-tune a model for each user, yet simply send all users back the same base model. To address this issue, we propose vTune, a simple method that uses a small number of backdoor data points added to the training data to provide a statistical test for verifying that a provider fine-tuned a custom model on a particular user's dataset. Unlike existing works, vTune is able to scale to verification of fine-tuning on state-of-the-art LLMs, and can be used both with open-source and closed-source models. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets, and find that the statistical test is satisfied with p-values on the order of $\sim 10^{-40}$, with no negative impact on downstream task performance. Further, we explore several attacks that attempt to subvert vTune and demonstrate the method's robustness to these attacks.
翻译:随着大语言模型(LLMs)微调日益普及,用户通常依赖第三方服务进行微调,但对其微调过程缺乏可见性。这种透明度的缺失引发了一个问题:消费者如何验证微调服务是否被正确执行?例如,服务提供商可能声称对每个用户进行模型微调,却仅向所有用户返回相同的基座模型。为解决这一问题,我们提出vTune——一种通过在训练数据中添加少量后门数据点来提供统计检验的简易方法,用于验证服务提供商是否基于特定用户数据集进行了定制化模型微调。与现有工作不同,vTune能够扩展至对前沿大语言模型微调的验证,并适用于开源与闭源模型。我们在多个模型系列、不同规模以及多种指令微调数据集上测试该方法,发现统计检验的p值可达到约$10^{-40}$量级,且对下游任务性能无负面影响。此外,我们探究了多种试图破坏vTune的攻击方式,并证明了该方法对这些攻击具有鲁棒性。