Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.
翻译:语言模型后训练是塑造模型行为的主要阶段,但这一过程仍主要涉及对抽象标量奖励的优化,这些奖励汇总了多样的期望目标。这种抽象化使实践者难以了解数据实际教会模型的内容,导致模型学习到虚假相关性,并引发过度风格化、谄媚等不良行为。针对这一问题,我们提出:能否在优化前检查偏好数据集,在概念层面决定模型应被允许学习哪些行为?基于此,我们引入一种以数据为中心的后训练流程,利用可解释性协议为区分偏好与非偏好生成的潜在概念建立统计假设,进而将概念显式化以获取细粒度的用户反馈。基于这一视角,我们将多种基于可解释性的训练协议统一为通过特征或数据干预来塑造奖励的方法。实验表明,我们的流程能够识别现有偏好数据中的不良信号,缓解目标外学习问题,并有助于放大或塑造所需属性(如安全防护和模型个性)。更广泛而言,我们的结果表明,可解释性能够将后训练从优化不透明代理奖励的过程,转变为审计和雕琢学习信号本身的过程。