Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
翻译:语言模型常常表现出有助于预训练目标但损害下游任务性能的行为。我们提出一种移除不良行为的新方法,即消融模型组件之间的少量因果路径,旨在禁用导致不良行为的计算回路。给定一小部分模型行为不佳的输入数据集,我们学习消融少量重要因果路径。在减少GPT-2有毒语言生成的场景中,我们发现仅消融11.6K个因果边中的12个即可减轻有毒生成,同时对其他输入的性能影响极小。