This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.
翻译:本文通过引入蒸馏自批判(dSC)方法,将RLAIF解释为一种贝叶斯推断过程。该方法通过吉布斯采样器优化LLM的输出,随后将优化结果蒸馏为微调模型。dSC仅需合成数据,并在安全性、情感及隐私控制等实验中得到验证,表明其能够成为对齐LLM的可行且廉价替代方案。代码已发布于\url{https://github.com/vicgalle/distilled-self-critique}。