Large Language Models (LLMs) exhibit systematic biases across demographic groups. Auditing is proposed as an accountability tool for black-box LLM applications, but suffers from resource-intensive query access. We conceptualise auditing as uncertainty estimation over a target fairness metric and introduce BAFA, the Bounded Active Fairness Auditor for query-efficient auditing of black-box LLMs. BAFA maintains a version space of surrogate models consistent with queried scores and computes uncertainty intervals for fairness metrics (e.g., $Δ$ AUC) via constrained empirical risk minimisation. Active query selection narrows these intervals to reduce estimation error. We evaluate BAFA on two standard fairness dataset case studies: \textsc{CivilComments} and \textsc{Bias-in-Bios}, comparing against stratified sampling, power sampling, and ablations. BAFA achieves target error thresholds with up to 40$\times$ fewer queries than stratified sampling (e.g., 144 vs 5,956 queries at $\varepsilon=0.02$ for \textsc{CivilComments}) for tight thresholds, demonstrates substantially better performance over time, and shows lower variance across runs. These results suggest that active sampling can reduce resources needed for independent fairness auditing with LLMs, supporting continuous model evaluations.
翻译:大语言模型(LLMs)在不同人口统计群体间表现出系统性偏见。审计被提出作为黑盒LLM应用的一种问责工具,但其依赖于资源密集型的查询访问。本文将审计概念化为目标公平性指标的不确定性估计问题,并提出了BAFA(有界主动公平性审计器),用于对黑盒LLMs进行高效查询审计。BAFA维护一个与已查询分数一致的代理模型版本空间,并通过约束经验风险最小化计算公平性指标(例如 $Δ$ AUC)的不确定性区间。主动查询选择通过缩小区间来降低估计误差。我们在两个标准公平性数据集案例研究(\textsc{CivilComments} 和 \textsc{Bias-in-Bios})上评估BAFA,并与分层抽样、功效抽样及消融实验进行对比。在严格阈值下,BAFA达到目标误差阈值所需的查询次数比分层抽样最多减少40倍(例如在 \textsc{CivilComments} 数据集上,当 $\varepsilon=0.02$ 时仅需144次查询,而分层抽样需5,956次),随时间推移表现出显著更优的性能,且在不同运行间方差更低。这些结果表明,主动抽样能够减少对LLMs进行独立公平性审计所需的资源,支持持续模型评估。