Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.
翻译:大多数关于自然语言处理公平性的研究假设能直接获取性别、种族或国籍等受保护属性。然而在实践中,由于隐私限制、元数据缺失或法律约束,这类信息通常难以获取——即便模型可能从间接文本线索推断出这些属性。这引发了关键问题:在不直接访问敏感属性的情况下,能否实现有效的去偏?我们提出H-SAL方法,利用自我描述文本作为隐式去偏信号,对概念和属性进行事后消除。为支持该场景,我们引入基于Stack Exchange的多领域公平性基准测试,其包含显式与隐式信号,可比较使用受保护标签的标准去偏与无敏感信息去偏的效果。在编码器与解码器语言模型上的实验表明,隐式自我描述信号的表现通常等于或优于基于显式标签的去偏方法。本研究成果拓展了表征层面的公平性研究,并为真实数据约束下的去偏研究提供了新基准。