Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
翻译:语言模型常呈现有利于预训练目标但损害下游任务性能的行为。我们提出一种消除不良行为的新方法,通过消融模型组件间少量因果路径,旨在禁用导致不良行为的计算回路。基于少量模型表现欠佳的输入数据集,我们学习消融少量关键因果路径。在减少GPT-2有毒语言生成的实验中,我们发现仅消融11,600条因果边中的12条即可缓解有毒生成,且对其他输入的性能影响极小。