Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
翻译:当前对模型编辑的评估仅使用提示后的“后续若干词”补全。因此,这些方法对较长自然语言生成的影响在很大程度上尚不明确。我们提出长文本编辑评估(LEME),这是一种衡量模型编辑在长文本生成场景中效果与影响的新型评估协议。该协议包含机器评分调查和与人类评分高度相关的分类器。重要的是,我们发现该协议与现有的短文本指标关联性极弱(尽管其设计初衷是将有效性、泛化性、局部性和可迁移性扩展至长文本环境),表明我们的方法引入了理解模型编辑方法的新维度。利用该协议,我们对多种模型编辑技术进行基准测试,并得出若干发现,包括:部分方法(如ROME和MEMIT)在有限范围内能实现一致编辑,但其事实漂移现象远比其他方法严重。最后,我们通过定性分析揭示了长文本生成场景中的常见失效模式,包括内部一致性、词汇连贯性和局部性问题。