Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.
翻译:大型语言模型(LLM)的对齐依赖于复杂的奖励信号,这些信号往往掩盖了被激励的具体行为,从而产生错位和奖励破解的关键风险。现有的解释方法通常依赖于预定义的评估标准,存在遗漏“未知的未知因素”的风险,或者无法识别能够全面覆盖并对模型行为具有因果性的目标。为应对这些局限性,我们提出了Obj-Disco框架,该框架能够自动将对齐奖励信号分解为稀疏的、加权组合的人类可解释的自然语言目标。我们的方法采用迭代贪心算法来分析训练检查点之间的行为变化,识别并验证最能解释剩余奖励信号的候选目标。跨多种任务、模型规模和对齐算法的广泛评估证明了该框架的鲁棒性。使用流行开源奖励模型的实验表明,该框架始终能捕捉超过90%的奖励行为,这一发现进一步得到了人工评估的证实。此外,一项关于与开源奖励模型对齐的案例研究表明,Obj-Disco能够成功识别出伴随预期行为出现的潜在错位激励。我们的工作为揭示LLM对齐中的隐含目标提供了关键工具,为更透明、更安全的AI发展铺平了道路。