Wide usage of ChatGPT has highlighted the potential of reinforcement learning from human feedback. However, its training pipeline relies on manual ranking, a resource-intensive process. To reduce labor costs, we propose a self-supervised text ranking approach for applying Proximal-Policy-Optimization to fine-tune language models while eliminating the need for human annotators. Our method begins with probabilistic sampling to encourage a language model to generate diverse responses for each input. We then employ TextRank and ISODATA algorithms to rank and cluster these responses based on their semantics. Subsequently, we construct a reward model to learn the rank and optimize our generative policy. Our experimental results, conducted using two language models on three tasks, demonstrate that the models trained by our method considerably outperform baselines regarding BLEU, GLEU, and METEOR scores. Furthermore, our manual evaluation shows that our ranking results exhibit a remarkably high consistency with that of humans. This research significantly reduces training costs of proximal policy-guided models and demonstrates the potential for self-correction of language models.
翻译:ChatGPT的广泛应用凸显了从人类反馈中进行强化学习的潜力。然而,其训练流程依赖人工排序这一资源密集型过程。为降低劳动力成本,我们提出了一种自监督文本排序方法,在无需人工标注员的情况下,应用近端策略优化(PPO)微调语言模型。本方法首先通过概率采样促使语言模型为每个输入生成多样化响应,随后利用TextRank算法和ISODATA算法基于语义对这些响应进行排序与聚类,进而构建奖励模型学习排序结果以优化生成策略。我们在两个语言模型上针对三项任务开展的实验表明,经本方法训练的模型在BLEU、GLEU和METEOR评分上显著优于基线模型。此外,人工评估显示我们的排序结果与人类标注结果具有高度一致性。本研究显著降低了近端策略引导模型的训练成本,并展现了语言模型自我纠错的潜力。