Large language models (LLMs) can easily generate biased and discriminative responses. As LLMs tap into consequential decision-making (e.g., hiring and healthcare), it is of crucial importance to develop strategies to mitigate these biases. This paper focuses on social bias, tackling the association between demographic information and LLM outputs. We propose a causality-guided debiasing framework that utilizes causal understandings of (1) the data-generating process of the training corpus fed to LLMs, and (2) the internal reasoning process of LLM inference, to guide the design of prompts for debiasing LLM outputs through selection mechanisms. Our framework unifies existing de-biasing prompting approaches such as inhibitive instructions and in-context contrastive examples, and sheds light on new ways of debiasing by encouraging bias-free reasoning. Our strong empirical performance on real-world datasets demonstrates that our framework provides principled guidelines on debiasing LLM outputs even with only the black-box access.
翻译:大型语言模型(LLMs)容易生成带有偏见和歧视性的回应。随着LLMs涉足关键决策领域(如招聘和医疗健康),制定缓解这些偏见的策略至关重要。本文聚焦于社会偏见,探讨人口统计信息与LLM输出之间的关联。我们提出一个因果引导的去偏见框架,该框架利用对(1)训练语料数据生成过程,以及(2)LLM推理内部因果机制的因果理解,通过选择机制来指导提示的设计以去偏见LLM输出。我们的框架统一了现有去偏见提示方法(如抑制性指令和上下文中的对比示例),并揭示了通过鼓励无偏推理来实现去偏见的新途径。在实际数据集上的强实证表现表明,即使在仅具备黑盒访问权限的情况下,我们的框架也为去偏见LLM输出提供了原则性指导。