Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.
翻译:大型语言模型(LLM)在生成过程中可能引发社会偏见,尤其是在处理有害提示时进行推理。控制生成过程中的敏感属性面临数据分布、泛化能力和效率方面的挑战。具体而言,微调和检索方法需要大量无偏语料库,而直接提示法则需精心设计多轮思维修正指令以校正输出,但会带来内存和推理延迟方面的挑战。本研究提出专家引导的有害标记消融去偏置生成方法(EXPOSED),旨在无需满足上述条件的情况下消除LLM的不良有害输出。EXPOSED基于丰富的有害语料库构建去偏置专家模型,以暴露和引发潜在的危险标记。随后,该方法对LLM的输出进行处理,通过抑制和衰减有害标记来构建公平分布。EXPOSED在三个LLM系列上通过公平性基准进行评估。大量实验表明,与其他基线方法相比,所提出的EXPOSED在平衡公平性与生成性能的同时,能显著降低潜在的社会偏见。