Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.
翻译:大型语言模型(LLMs)在遵循用户指令并生成有用回答方面取得了巨大成功。然而,其鲁棒性仍远未达到最优状态——当语言化指令出现微小变化时,模型可能生成显著不一致的响应。近期研究探讨了这一不一致性问题,强调了持续改进响应生成鲁棒性的重要性,但目前仍缺乏系统性分析与解决方案。本文对不一致性问题进行了量化定义,并提出一个两阶段训练框架,包含指令增强的监督微调与一致性对齐训练。第一阶段通过相似指令增强帮助模型泛化指令遵循能力;第二阶段通过区分相似响应中的细微差异,提升多样性并帮助模型理解哪些响应更符合人类预期。整个训练过程依赖第一阶段训练模型自身推断的自奖励机制,无需借助外部人类偏好资源。我们在近期公开可用的LLMs上针对指令遵循任务开展大量实验,验证了该训练框架的有效性。