Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

翻译：大多数关于自然语言处理公平性的研究假设能直接获取性别、种族或国籍等受保护属性。然而在实践中，由于隐私限制、元数据缺失或法律约束，这类信息通常难以获取——即便模型可能从间接文本线索推断出这些属性。这引发了关键问题：在不直接访问敏感属性的情况下，能否实现有效的去偏？我们提出H-SAL方法，利用自我描述文本作为隐式去偏信号，对概念和属性进行事后消除。为支持该场景，我们引入基于Stack Exchange的多领域公平性基准测试，其包含显式与隐式信号，可比较使用受保护标签的标准去偏与无敏感信息去偏的效果。在编码器与解码器语言模型上的实验表明，隐式自我描述信号的表现通常等于或优于基于显式标签的去偏方法。本研究成果拓展了表征层面的公平性研究，并为真实数据约束下的去偏研究提供了新基准。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

【NeurlPS2024】一种适用于跨模态和任务的视觉-语言模型的统一去偏方法

专知会员服务

22+阅读 · 2024年10月11日

【MIT博士论文】保证性生成模型，155页pdf

专知会员服务

31+阅读 · 2023年8月8日

AAAI2022 | 因果推理下的词向量：降低性别偏见并保留语义信息

专知会员服务

24+阅读 · 2022年2月15日