Debiasing methods that seek to mitigate the tendency of Language Models (LMs) to occasionally output toxic or inappropriate text have recently gained traction. In this paper, we propose a standardized protocol which distinguishes methods that yield not only desirable results, but are also consistent with their mechanisms and specifications. For example, we ask, given a debiasing method that is developed to reduce toxicity in LMs, if the definition of toxicity used by the debiasing method is reversed, would the debiasing results also be reversed? We used such considerations to devise three criteria for our new protocol: Specification Polarity, Specification Importance, and Domain Transferability. As a case study, we apply our protocol to a popular debiasing method, Self-Debiasing, and compare it to one we propose, called Instructive Debiasing, and demonstrate that consistency is as important an aspect to debiasing viability as is simply a desirable result. We show that our protocol provides essential insights into the generalizability and interpretability of debiasing methods that may otherwise go overlooked.
翻译:旨在缓解语言模型偶尔输出有害或不适当文本倾向的去偏方法近年来备受关注。本文提出一种标准化协议,用于区分不仅能产生理想结果、且与其机制和规范保持一致的去偏方法。例如,我们探究:若一种旨在降低语言模型有害性的去偏方法所采用的有害定义被反转,其去偏结果是否也会相应反转?基于此类考量,我们为新协议制定了三项准则:规范极性、规范重要性与领域可迁移性。作为案例研究,我们将该协议应用于流行的自去偏方法,并与我们提出的指导性去偏方法进行对比,论证一致性对于去偏方法的可行性而言,与单纯追求理想结果同等重要。本研究证明,该协议能揭示去偏方法中通常被忽视的泛化性与可解释性关键洞见。