We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.
翻译:我们提出了一种使用直接偏好优化(DPO)进行大语言模型微调的方法,这是一种强化学习技术。实验结果表明,DPO简化了训练流程,提升了计算效率,并实现了具有竞争力的性能。采用BLEU、ROUGE和余弦相似度指标进行的评估表明模型学习有效且收敛良好,但仍需进一步研究以解决观察到的训练不稳定性问题。