AgentFairBench: Do LLM Agents Discriminate When They Act?

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

翻译：大型语言模型（LLM）Agent越来越多地承担实际行动（如筛选申请者、推荐信贷额度、分诊患者），然而目前对LLM公平性的评估仍停留在对回答评分的层面。我们提出AgentFairBench，这是一个低成本、可复现、多领域的基准测试，用于衡量LLM Agent在实际行动中的人口统计差异。该基准测试基于配套框架——偏差传导框架（Bias Conduction Framework，简称BCF，本文中重述），涵盖三个受监管领域：招聘、信贷和医疗分诊。研究采用合成且人口统计中性的档案，在仅改变姓名编码的种族与性别信号（继承Bertrand-Mullainathan研究范式）的反事实匹配组中进行评估，测试四种代理增强程度递增的Agent架构（直接回答、思维链、多Agent协商、工具增强）的表现。基于NumPy的测试框架可计算反事实翻转率、平均绝对分数差（MASD）、行动率差异和工具调用差异，并配有自助法置信区间、配对检验和错误发现率控制，每次模型测试成本仅需数美元。我们提供实时排行榜，包含保留的私有数据划分和污染检测机制，支持外部模型通过提交方式参与。初步试验（864次决策及重复测试验证）揭示一个方法论启示：将六组分数分散度与两次运行的噪声差异直接比较，会因统计元数差异将偏差高估约2.4倍。在与元数匹配的噪声基准和全面组别检验下，Claude Haiku 4.5未显示超出采样噪声的人口统计效应（120组成对检验中无显著结果，9组全面对比中无显著结果）；植入偏差测试证实该工具能在存在差异时有效检测。核心贡献包括：可靠、灵敏且即用即得的评测工具、元数匹配的零假设方法论，以及可扩展的开源制品。我们以开放许可协议发布代码、数据和测试框架，并提供匿名评审制品。