This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.
翻译:本文通过引入蒸馏自批评(dSC)方法,将RLAIF(基于人工智能反馈的强化学习)解释为贝叶斯推理过程。该方法利用吉布斯采样器优化大语言模型的输出,随后将优化结果蒸馏至微调模型中。仅需合成数据即可实施的dSC方法,在安全性、情感倾向及隐私控制等实验场景中展现出作为经济可行的替代方案来对齐大语言模型的潜力。相关代码已开源至\url{https://github.com/vicgalle/distilled-self-critique}。