Instruct LLM provide a paradigm used in large scale language model to align LLM to human preference. The paradigm contains supervised fine tuning and reinforce learning from human feedback. This paradigm is also used in downstream scenarios to adapt LLM to specific corpora and applications. Comparing to SFT, there are many efforts focused on RLHF and several algorithms being proposed, such as PPO, DPO, IPO, KTO, MinorDPO and etc. Meanwhile most efforts for SFT are focused on how to collect, filter and mix high quality data. In this article with insight from DPO and MinorDPO, we propose a training metric for SFT to measure the discrepancy between the optimized model and the original model, and a loss function MinorSFT that can increase the training effectiveness, and reduce the discrepancy between the optimized LLM and original LLM.
翻译:指令微调为大语言模型提供了一种使其与人类偏好对齐的范式。该范式包含监督微调和基于人类反馈的强化学习。此范式亦被用于下游场景,以使大语言模型适应特定语料库和应用。与监督微调相比,现有研究多聚焦于基于人类反馈的强化学习,并提出了多种算法,如PPO、DPO、IPO、KTO、MinorDPO等。与此同时,针对监督微调的研究大多集中于如何收集、筛选与混合高质量数据。本文受DPO与MinorDPO的启发,提出了一种用于监督微调的训练度量指标,以衡量优化后模型与原始模型之间的差异,并设计了一种损失函数MinorSFT。该损失函数能够提升训练效率,并减少优化后的大语言模型与原始大语言模型之间的偏差。