As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in the decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.
翻译:随着人类日益频繁地与由强化学习、大语言模型及其他技术驱动的多样化智能体共享环境,用自然语言解释智能体策略的能力对于实现可靠的共存至关重要。本文提出了一种通用框架,该框架通过基于人工智能反馈的强化学习来训练生成解释的大语言模型,其分布奖励由生成式连续归一化流生成。连续归一化流能够捕捉人类对解释判断的多元化和概率性本质。此外,在温和的假设下,当使用来自大语言模型的噪声代理奖励进行训练时,连续归一化流可证明地约束了与真实人类奖励分布的偏差。我们设计了一种专门的连续归一化流架构,该架构在生成奖励时选择性地关注决策上下文和解释中的语言线索。人类与大语言模型评估者均发现,与使用代理大语言模型奖励或最先进的基于人类反馈的强化学习和基于人工智能反馈的强化学习基线训练的解释相比,我们的方法生成的解释能使人更准确地预测真实智能体决策,展现出更强的逻辑严谨性与可操作性,并施加更低的认知负荷。