Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.
翻译:高风险领域的定量风险评估依赖于结构化专家启发法来估计不可观测属性。黄金标准——德尔菲法——能够产生经过校准、可审计的判断,但需要数月的协调和专家时间,使得严格的风险评估在大多数应用中难以实现。我们研究大型语言模型(LLMs)是否可作为结构化专家启发法的可扩展代理。我们提出可扩展Delphi方法,通过多样化专家角色设定、迭代优化和原理共享,将经典协议适配于LLMs。由于目标量通常不可观测,我们建立了基于必要条件的评估框架:针对可验证代理的校准性、对证据的敏感性以及与人类专家判断的一致性。我们在AI增强网络安全风险领域进行评估,使用三个能力基准和独立的人类启发研究。LLM专家小组与基准真实值呈现强相关性(皮尔逊r=0.87-0.95),随着证据增加系统性改进,并与人类专家小组保持一致性——在某次比较中,LLM小组与人类小组的接近程度甚至超过两个人类小组之间的相互接近度。这表明基于LLM的启发法可将结构化专家判断扩展到传统方法不可行的场景,将启发时间从数月缩短至数分钟。