Pre-trained Large Language Models (LLMs) have significantly advanced natural language processing capabilities but are susceptible to biases present in their training data, leading to unfair outcomes in various applications. While numerous strategies have been proposed to mitigate bias, they often require extensive computational resources and may compromise model performance. In this work, we introduce AXOLOTL, a novel post-processing framework, which operates agnostically across tasks and models, leveraging public APIs to interact with LLMs without direct access to internal parameters. Through a three-step process resembling zero-shot learning, AXOLOTL identifies biases, proposes resolutions, and guides the model to self-debias its outputs. This approach minimizes computational costs and preserves model performance, making AXOLOTL a promising tool for debiasing LLM outputs with broad applicability and ease of use.
翻译:预训练的大语言模型(LLMs)显著提升了自然语言处理能力,但其训练数据中存在的偏见可能导致多种应用场景下的不公平结果。尽管已有多种偏见缓解策略被提出,但这些方法通常需要大量计算资源,且可能影响模型性能。本文提出AXOLOTL这一新颖的后处理框架,该框架以任务和模型无关的方式运作,通过公共API与大语言模型交互,无需直接访问其内部参数。该框架采用类似零样本学习的三步骤流程:识别偏见、提出解决方案并引导模型自主消除输出中的偏见。该方法在最小化计算成本的同时保持了模型性能,使AXOLOTL成为具有广泛适用性和易用性的大语言模型输出去偏工具。