Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
翻译:在大规模数据集上训练的语言模型已被证明能够学习编码抽象概念(如事实性或意图)的特征。此类特征传统上用于测试时监控或引导。我们提出了一种替代性功能:将特征作为开放任务的可扩展监督手段。我们以减少幻觉这一理想但开放的行为为例,设计了一个名为RLFR(基于特征奖励的强化学习)的强化学习流程,该流程使用特征作为奖励函数。基于一个识别候选幻觉主张的新型探测框架,我们的流程教导模型在其对补全内容的事实性不确定时进行干预和修正。此外,该流程通过我们的奖励特征引导,实现了可扩展的测试时计算。在Gemma-3-12B-IT模型上实施的这一端到端流程产生了一个策略,其产生幻觉的可能性比原始模型降低了58%,同时在标准基准测试中保持了性能。综上所述,通过将监督建立在特征语言的基础上,本文为利用可解释性学习开放任务引入了一种新颖的范式。