Large language models (LLMs) have shown remarkable advances in language generation and understanding but are also prone to exhibiting harmful social biases. While recognition of these behaviors has generated an abundance of bias mitigation techniques, most require modifications to the training data, model parameters, or decoding strategy, which may be infeasible without access to a trainable model. In this work, we leverage the zero-shot capabilities of LLMs to reduce stereotyping in a technique we introduce as zero-shot self-debiasing. With two approaches, self-debiasing via explanation and self-debiasing via reprompting, we show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups while relying only on the LLM itself and a simple prompt, with explanations correctly identifying invalid assumptions and reprompting delivering the greatest reductions in bias. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
翻译:大型语言模型(LLMs)在语言生成和理解方面取得了显著进展,但也容易表现出有害的社会偏见。尽管对这些行为的认识催生了大量偏见缓解技术,但大多数方法需要修改训练数据、模型参数或解码策略,这在无法访问可训练模型的情况下可能难以实现。在本工作中,我们利用LLMs的零样本能力来减少刻板印象,提出了一种称为零样本自去偏的技术。通过两种方法——基于解释的自去偏和基于重新提示的自去偏——我们证明了自去偏能在仅依赖LLM本身与简单提示的情况下显著降低九个不同社会群体中的刻板印象程度,其中解释能正确识别无效假设,而重新提示实现了最大的偏见降低。我们希望这项工作能开辟对其他零样本偏见缓解技术的探索。