We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
翻译:我们提出了一种交互式多智能体分类器,即便对于神经网络等复杂智能体,也能提供可证明的可解释性保证。这些保证由所选特征与分类决策之间互信息的下界构成。我们的结果受交互式证明系统中的Merlin-Arthur协议启发,并以可度量指标(如可靠性与完备性)表达这些下界。与现有交互式框架相比,我们既不依赖最优智能体,也不假设特征独立分布;相反,我们利用智能体的相对优势以及"非对称特征相关性"这一新概念——该概念精确刻画了导致可解释性保证困难的关联类型。我们在两个可显式验证高互信息的小规模数据集上评估了所提方法。