Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
翻译:大型语言模型(LLMs)通常提供看似合理的思维链(CoT)推理轨迹,但这些轨迹可能隐藏着内部偏见。我们称此类偏见为*未明示偏见*。因此,通过模型陈述的推理过程进行监控并不可靠,而现有的偏见评估方法通常需要预定义类别和人工构建的数据集。本研究提出了一种全自动、黑盒化的流程,用于检测任务特定的未明示偏见。给定任务数据集,该流程利用LLM自动评估器生成候选偏见概念。随后,通过生成正负样本变体,在逐渐增大的输入样本上测试每个概念,并应用多重检验与提前终止等统计技术。若某个概念在模型性能上产生统计学显著差异,且未在模型的思维链中被引为决策依据,则被标记为未明示偏见。我们在三个决策任务(招聘、贷款审批和大学录取)上对六种LLM进行了流程评估。我们的技术自动发现了这些模型中先前未知的偏见(例如西班牙语流利度、英语熟练度、写作正式性)。在同一轮实验中,该流程也验证了已有研究通过人工方式识别的偏见(性别、种族、宗教、民族)。更广泛而言,我们提出的方法为任务特定的自动偏见发现提供了一条实用且可扩展的路径。