Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.
翻译:大型语言模型(LLMs)在遵循用户指令并生成有用响应方面取得了巨大成功。然而,其鲁棒性仍远未达到最优状态,因为口头指令的微小变化可能导致模型生成显著不一致的响应。近期文献已开始探索这一不一致性问题,强调了持续改进响应生成鲁棒性的重要性,但仍缺乏系统性分析与解决方案。本文对不一致性问题进行了定量定义,并提出一个两阶段训练框架:指令增强的监督微调与一致性对齐训练。第一阶段通过相似指令增强帮助模型泛化指令遵循能力;第二阶段通过区分相似响应的细微差异,提升多样性并帮助模型理解哪些响应更符合人类期望。整个训练过程无需外部人类偏好资源,完全依赖第一阶段训练模型生成的自我奖励信号。我们在近期公开可用的LLMs上针对指令遵循任务进行了广泛实验,验证了所提训练框架的有效性。