Pre-trained large language models (LLMs) can now be easily adapted for specific business purposes using custom prompts or fine tuning. These customizations are often iteratively re-engineered to improve some aspect of performance, but after each change businesses want to ensure that there has been no negative impact on the system's behavior around such critical issues as bias. Prior methods of benchmarking bias use techniques such as word masking and multiple choice questions to assess bias at scale, but these do not capture all of the nuanced types of bias that can occur in free response answers, the types of answers typically generated by LLM systems. In this paper, we identify several kinds of nuanced bias in free text that cannot be similarly identified by multiple choice tests. We describe these as: confidence bias, implied bias, inclusion bias and erasure bias. We present a semi-automated pipeline for detecting these types of bias by first eliminating answers that can be automatically classified as unbiased and then co-evaluating name reversed pairs using crowd workers. We believe that the nuanced classifications our method generates can be used to give better feedback to LLMs, especially as LLM reasoning capabilities become more advanced.
翻译:预训练的大型语言模型(LLMs)如今可通过定制提示或微调轻松适配特定商业用途。这些定制化方案常经过迭代式重新设计以提升某些性能指标,但每次修改后企业都需确保系统在偏见等关键问题上的行为未受负面影响。现有的偏见基准测试方法采用词汇遮蔽和多项选择题等技术进行规模化评估,但这些方法无法捕捉自由回答中可能出现的所有微妙偏见类型——而这正是LLM系统通常生成的回答形式。本文识别了自由文本中无法通过选择题测试检测的若干微妙偏见类型,并将其定义为:置信度偏见、暗示性偏见、包容性偏见和抹除性偏见。我们提出一种半自动化检测流程:首先自动筛除可归类为无偏见的回答,随后通过众包工作者对姓名反转配对进行协同评估。我们相信,该方法生成的精细分类结果可为LLMs提供更优质的反馈,尤其随着LLM推理能力日益精进。