Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
翻译:概念解释量化了高层概念(如性别或经验)如何影响模型行为,这对于高风险领域的决策者至关重要。现有研究通过将此类解释与基于反事实估计的参考因果效应进行比较,来评估其忠实性。在实践中,现有基准依赖于成本高昂的人工撰写反事实,这仅是一个不完美的代理。为解决此问题,我们提出了一个构建包含结构反事实对数据集的框架:LIBERTy(基于LLM的、带有参考目标的可解释性干预基准)。LIBERTy建立在文本生成的明确定义的结构因果模型(SCMs)基础上,对概念的干预通过SCM传播,直至LLM生成反事实。我们引入了三个数据集(疾病检测、简历筛选和工作场所暴力预测)以及一个新的评估指标——顺序忠实性。利用这些,我们评估了五种模型上的多种方法,并发现概念解释存在显著的改进空间。LIBERTy还支持对模型干预敏感性的系统分析:我们发现专有LLM对人口统计概念的敏感性显著降低,这很可能源于训练后的缓解措施。总体而言,LIBERTy为开发忠实的可解释性方法提供了一个亟需的基准。