Explanations of model behavior are commonly evaluated via proxy properties weakly tied to the purposes explanations serve in practice. We contribute a decision theoretic framework that treats explanations as information signals valued by the expected improvement they enable on a specified decision task. This approach yields three distinct estimands: 1) a theoretical benchmark that upperbounds achievable performance by any agent with the explanation, 2) a human-complementary value that quantifies the theoretically attainable value that is not already captured by a baseline human decision policy, and 3) a behavioral value representing the causal effect of providing the explanation to human decision-makers. We instantiate these definitions in a practical validation workflow, and apply them to assess explanation potential and interpret behavioral effects in human-AI decision support and mechanistic interpretability.
翻译:模型行为的解释通常通过代理属性进行评估,这些属性与解释在实际应用中所服务的目的关联较弱。我们提出了一个决策理论框架,将解释视为信息信号,其价值由其在特定决策任务上所能实现的预期改进来衡量。该方法产生了三个不同的估计量:1)一个理论基准,为任何拥有该解释的智能体所能达到的性能设定了上限;2)一个人类互补价值,用于量化理论上可获取但尚未被基线人类决策策略所捕获的价值;3)一个行为价值,代表向人类决策者提供解释所产生的因果效应。我们将这些定义实例化为一个实用的验证工作流程,并应用于评估人机决策支持和机制可解释性中的解释潜力及解释行为效应。