Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.
翻译:大型语言模型(LLM)正越来越多地用于高风险决策,但其对虚假特征的敏感性仍未得到充分表征。我们提出了ICE-Guard框架,该框架应用干预一致性测试来检测三种类型的虚假特征依赖:人口统计学(姓名/种族交换)、权威性(资历/声望交换)和框架效应(正面/负面重述)。在涵盖10个高风险领域的3000个情景中,我们评估了来自8个家族的11个LLM,发现:(1)权威偏见(均值5.8%)和框架偏见(5.0%)显著超过人口统计学偏见(2.2%),挑战了该领域对人口统计学的狭窄关注;(2)偏见集中在特定领域——金融领域显示22.6%的权威偏见,而刑事司法领域仅为2.8%;(3)结构化分解(LLM提取特征并由确定性评分规则决定)可将翻转率降低高达100%(9个模型的中位数为49%)。我们展示了一个ICE引导的“检测-诊断-缓解-验证”循环,通过迭代提示修补实现了累计78%的偏见减少。在与真实COMPAS累犯数据的验证中,COMPAS衍生的翻转率超过合成池化率,表明我们的基准提供了对现实世界偏见的保守估计。代码和数据已公开。