While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where supervision for ground truth reasoning is unavailable.
翻译:尽管链式思维提示(CoT)有望提高语言模型推理的可解释性,但它可能系统性地歪曲影响模型行为的因素——例如,在未提及该偏置的情况下,沿着用户意见的方向对答案进行合理化解释。为缓解这一偏差推理问题,我们提出了偏置增强一致性训练(BCT),这是一种无监督微调方案,旨在训练模型在面对包含与不包含偏置特征的提示时给出一致的推理。我们构建了一套测试套件,涵盖七项问答任务中的九种偏差推理形式,并发现对GPT-3.5-Turbo应用单一偏置的BCT后,其保留任务上的偏差推理率降低了86%。此外,该模型能泛化至其他类型的偏置,在保留偏置类型上的偏差推理平均减少37%。由于BCT能泛化至未见过的偏置且无需金标准标签,该方法有望降低尚不可知偏置所引发的偏差推理,并适用于无法提供真实推理监督的任务场景。