Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across seven LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic, more efficient, and broader task-specific unverbalized bias discovery.
翻译:大语言模型(LLMs)常提供看似合理的思维链(CoT)推理过程,但其中可能隐藏内在偏见,我们称之为未言明的偏见。通过模型陈述的推理过程进行监测并不可靠,且现有偏见评估通常需要预设类别和人工标注的数据集。本研究提出一种全自动黑箱检测流水线,用于识别任务特定的未言明偏见。给定任务数据集后,该流水线首先利用LLM自动评分器生成候选偏见概念,然后通过生成正负变体在逐步增大的输入样本上测试每个概念,并采用统计技术进行多重检验与早期停止。当某个概念在模型CoT中未被引用为推理依据却产生统计显著的性能差异时,即被标记为未言明偏见。我们在七种LLM上针对三个决策任务(招聘、贷款审批与大学录取)评估该流水线。该技术自动发现了这些模型中先前未知的偏见(如西班牙语流利度、英语水平、书写规范性),同时在同一运行中验证了先前研究手动识别的偏见(如性别、种族、宗教、民族)。更广泛而言,本方法为自动、高效且更广泛的任务特定未言明偏见发现提供了可行的可扩展路径。