Robust Concept Erasure via Kernelized Rate-Distortion Maximization

Distributed representations provide a vector space that captures meaningful relationships between data instances. The distributed nature of these representations, however, entangles together multiple attributes or concepts of data instances (e.g., the topic or sentiment of a text, characteristics of the author (age, gender, etc), etc). Recent work has proposed the task of concept erasure, in which rather than making a concept predictable, the goal is to remove an attribute from distributed representations while retaining other information from the original representation space as much as possible. In this paper, we propose a new distance metric learning-based objective, the Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure. KRaM fits a transformation of representations to match a specified distance measure (defined by a labeled concept to erase) using a modified rate-distortion function. Specifically, KRaM's objective function aims to make instances with similar concept labels dissimilar in the learned representation space while retaining other information. We find that optimizing KRaM effectively erases various types of concepts: categorical, continuous, and vector-valued variables from data representations across diverse domains. We also provide a theoretical analysis of several properties of KRaM's objective. To assess the quality of the learned representations, we propose an alignment score to evaluate their similarity with the original representation space. Additionally, we conduct experiments to showcase KRaM's efficacy in various settings, from erasing binary gender variables in word embeddings to vector-valued variables in GPT-3 representations.

翻译：分布式表示提供了一个向量空间，能够捕捉数据实例之间的有意义关系。然而，这些表示的分布式特性会同时纠缠数据实例的多个属性或概念（例如，文本的主题或情感、作者的年龄、性别等特征）。近期研究提出了概念擦除任务，其目标并非使概念可预测，而是从分布式表示中移除某个属性，同时尽可能保留原始表示空间中的其他信息。本文提出了一种新的基于距离度量学习的客观函数——核化率失真最大化器（KRaM），用于执行概念擦除。KRaM通过修改率失真函数，使表示变换适配指定的距离度量（由待擦除的标记概念定义）。具体而言，KRaM的目标函数旨在使具有相似概念标签的实例在学习的表示空间中变得不相似，同时保留其他信息。我们发现，优化KRaM能有效擦除跨不同领域数据表示中的多种类型概念：分类变量、连续变量和向量值变量。我们还对KRaM目标函数的若干性质进行了理论分析。为评估学习表示的质量，我们提出了一种对齐分数，用于衡量其与原始表示空间的相似性。此外，我们通过实验展示了KRaM在各种场景中的有效性，从词嵌入中的二元性别变量擦除到GPT-3表示中的向量值变量擦除。