Interpretable Representations in Explainable AI: From Theory to Practice

from arxiv, Published in the *Special Issue on Explainable and Interpretable Machine Learning and Data Mining* of the Springer *Data Mining and Knowledge Discovery* journal

Interpretable representations are the backbone of many explainers that target black-box predictive systems based on artificial intelligence and machine learning algorithms. They translate the low-level data representation necessary for good predictive performance into high-level human-intelligible concepts used to convey the explanatory insights. Notably, the explanation type and its cognitive complexity are directly controlled by the interpretable representation, tweaking which allows to target a particular audience and use case. However, many explainers built upon interpretable representations overlook their merit and fall back on default solutions that often carry implicit assumptions, thereby degrading the explanatory power and reliability of such techniques. To address this problem, we study properties of interpretable representations that encode presence and absence of human-comprehensible concepts. We demonstrate how they are operationalised for tabular, image and text data; discuss their assumptions, strengths and weaknesses; identify their core building blocks; and scrutinise their configuration and parameterisation. In particular, this in-depth analysis allows us to pinpoint their explanatory properties, desiderata and scope for (malicious) manipulation in the context of tabular data where a linear model is used to quantify the influence of interpretable concepts on a black-box prediction. Our findings lead to a range of recommendations for designing trustworthy interpretable representations; specifically, the benefits of class-aware (supervised) discretisation of tabular data, e.g., with decision trees, and sensitivity of image interpretable representations to segmentation granularity and occlusion colour.

翻译：可解释表示是众多针对基于人工智能和机器学习算法的黑箱预测系统的解释器的核心基础。它们将实现良好预测性能所需的底层数据表示转化为用于传达解释洞见的高层人类可理解概念。值得注意的是，解释类型及其认知复杂性直接受可解释表示控制，调整该表示可针对特定受众和使用场景。然而，许多基于可解释表示构建的解释器忽视了其价值，转而采用默认方案（这些方案通常带有隐含假设），从而降低了此类技术的解释能力和可靠性。为解决此问题，我们研究了编码人类可理解概念存在与否的可解释表示的特性。我们展示了这些表示在表格、图像和文本数据中的操作化方式；讨论了其假设、优势与局限性；识别了其核心构建模块；并审视了其配置与参数化。特别是，这一深入分析使我们能够明确其在表格数据（使用线性模型量化可解释概念对黑箱预测影响）中的解释属性、设计目标及（恶意）操纵空间。我们的发现形成了一系列设计可信可解释表示的建议：具体包括采用类感知（监督式）离散化表格数据（例如使用决策树）的优势，以及图像可解释表示对分割粒度和遮挡颜色的敏感性。