Large language models (LLMs) executing tasks through instruction-based prompts often face challenges stemming from distribution differences between user instructions and training instructions. This leads to distractions and biases, especially when dealing with inconsistent dynamic labels. In this paper, we introduces a novel bias mitigation method, CRISPR, designed to alleviate instruction-label biases in LLMs. CRISPR utilizes attribution methods to identify bias neurons influencing biased outputs and employs pruning to eliminate the bias neurons. Experimental results demonstrate the method's effectiveness in mitigating biases in instruction-based prompting, enhancing language model performance on social bias benchmarks without compromising pre-existing knowledge. CRISPR proves highly practical, model-agnostic, offering flexibility in adapting to evolving social biases.
翻译:大型语言模型在执行基于指令提示的任务时,常因用户指令与训练指令的分布差异而面临挑战。这种差异会导致分心和偏差,尤其是在处理不一致的动态标签时。本文提出了一种新型偏置缓解方法CRISPR,旨在减轻指令-标签偏置对大型语言模型的影响。CRISPR利用归因方法识别影响偏置输出的偏置神经元,并通过修剪消除这些偏置神经元。实验结果表明,该方法能有效缓解基于指令提示中的偏置,在不损害已有知识的前提下提升语言模型在社会偏置基准测试中的表现。CRISPR具有高实用性和模型无关性,为适应不断变化的社会偏置提供了灵活性。