Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds "confidence," and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected-cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single-LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi-LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement-triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.
翻译:大型语言模型(LLM)越来越多地被部署为具有不对称错误成本的自主决策代理,应用场景包括招聘(错失人才与浪费面试成本)、医疗分诊(遗漏急症与不必要的升级处理)以及欺诈检测(通过欺诈交易与拒绝合法支付)。主流设计方法通过查询单一LLM获取状态后验分布,设定“置信度”阈值并采取行动;我们证明这种方法不适用于具有成本考量的序列决策。我们提出一种贝叶斯式、成本感知的多LLM编排框架,将LLM视为近似似然模型而非分类器。针对每个候选状态,我们通过对比提示技术获取似然估计,使用稳健统计方法聚合不同模型的输出,并在新证据出现时基于显式先验通过贝叶斯规则更新信念。该框架支持连贯的信念更新、基于期望成本的动作选择、通过信息价值进行原则性信息收集,以及通过集成方法缓解偏差实现公平性提升。在简历筛选中,设定错失人才的代价为40000美元、每次面试成本2500美元、每次电话筛选成本150美元的情况下,使用五种LLM(GPT-4o、Claude 4.5 Sonnet、Gemini Pro、Grok、DeepSeek)对1000份简历进行的实验表明:相较于最佳单LLM基线方法,总成本降低294000美元(降幅34%),人口统计公平性提升45%(最大群体差距从22个百分点降至5个百分点)。消融实验显示:51%的成本节约来自多LLM聚合,43%来自序列更新机制,20%来自分歧触发的信息收集,这与正确概率基础理论带来的预期优势相一致。