Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood. In this work, we formally define the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept. However, we demonstrate that a multiclass log-linear model \emph{can} be constructed that indirectly recovers the concept in some cases, pointing to the inherent limitations of log-linear guardedness as a downstream bias mitigation technique. These findings shed light on the theoretical limitations of linear erasure methods and highlight the need for further research on the connections between intrinsic and extrinsic bias in neural models.
翻译:从神经表示中擦除可解释概念的方法,若假设线性结构,已被发现兼具可操作性与实用性。然而,这种擦除对基于修改后表示训练的下游分类器行为的影响尚未完全明晰。本文正式定义了"对数线性防护性"这一概念,即攻击者无法直接从表示中预测该概念的能力,并探讨其潜在影响。研究表明,在二分类情形下,基于特定假设,下游对数线性模型无法恢复已擦除的概念。但我们也证实,在某些情况下可构建多类对数线性模型间接恢复该概念,这揭示了将对数线性防护性作为下游偏见缓解技术的固有局限性。这些发现阐明了线性擦除方法的理论局限,凸显了进一步研究神经模型中固有偏见与外在偏见之间联系的迫切性。