Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood. In this work, we formally define the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept. However, we demonstrate that a multiclass log-linear model \emph{can} be constructed that indirectly recovers the concept in some cases, pointing to the inherent limitations of log-linear guardedness as a downstream bias mitigation technique. These findings shed light on the theoretical limitations of linear erasure methods and highlight the need for further research on the connections between intrinsic and extrinsic bias in neural models.
翻译:从神经表征中擦除人类可解释概念的假设线性方法已被证明具有可操作性和实用性。然而,这种擦除对基于修改后表征训练的下游分类器行为的影响尚未完全明确。本研究正式定义了"对数线性保护性"这一概念,即对手无法直接从表征中预测概念的能力,并探讨其影响。我们证明,在二分类情况下,若满足特定假设,下游对数线性模型无法恢复被擦除的概念。但我们也证明,在某些情况下可以构建一个多类对数线性模型间接恢复该概念,这揭示了对数线性保护性作为下游偏差缓解技术的固有局限性。这些发现阐明了线性擦除方法的理论限制,并强调需进一步研究神经模型中内在偏差与外在偏差之间的关联。