We introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. A canonical example is a simple instance of good behavior, e.g., The capital of Mauritius is Port Louis) or bad behavior, e.g., An aspect of researchers is coldhearted). The evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of Mauritius is called for.) We create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. In our experiments on Pythia language models, we find that LoRA outperforms full finetuning and MEMIT. We then turn to the Backpack language model architecture because it is intended to enable targeted improvement. The Backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. We propose sense finetuning, which selects and finetunes a few ($\approx$ 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. Finally, we improve GPT-J-6B by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller Backpack, in one setting outperforming editing GPT-J itself (4.1% vs 1.0%).
翻译:我们提出基于规范示例的模型编辑方法,该方法具有以下特性:(1) 每种目标行为仅提供单一学习示例,(2) 评估完全在分布外进行,(3) 严格限制偏离初始模型的程度。规范示例是行为优劣的简单实例,例如良好行为示例"毛里求斯的首都是路易港"或不良行为示例"研究人员的特质是冷酷无情"。评估集则包含每个行为的更复杂示例(如一个段落中提及毛里求斯首都名称)。我们创建了三个数据集,并修改了另外三个数据集用于规范示例的模型编辑,涵盖知识密集型改进、社会偏见缓解和语法边界案例。在Pythia语言模型的实验中,我们发现LoRA方法优于全参数微调和MEMIT方法。随后我们转向Backpack语言模型架构,因其设计初衷是实现定向改进。该架构定义了庞大的义向量库——即每个单词不同用法的分解表示——通过加权求和形成模型输出logits。我们提出义向量微调方法,即为每个规范示例选择并微调少量(约10个)义向量,实验表明该方法优于其他微调方法(例如提升4.8%对比0.3%)。最终,我们通过推理时集成仅使用35倍更小规模Backpack的义向量微调结果,实现了对GPT-J-6B的改进,在某设置下甚至优于直接编辑GPT-J模型本身(4.1%对比1.0%)。