Ensuring the trustworthiness and interpretability of machine learning models is critical to their deployment in real-world applications. Feature attribution methods have gained significant attention, which provide local explanations of model predictions by attributing importance to individual input features. This study examines the generalization of feature attributions across various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers. We aim to assess the feasibility of utilizing a feature attribution method as a future detector and examine how these features can be harmonized across multiple models employing distinct architectures but trained on the same data distribution. By exploring this harmonization, we aim to develop a more coherent and optimistic understanding of feature attributions, enhancing the consistency of local explanations across diverse deep-learning models. Our findings highlight the potential for harmonized feature attribution methods to improve interpretability and foster trust in machine learning applications, regardless of the underlying architecture.
翻译:确保机器学习模型的可信度和可解释性对于其在现实世界应用中的部署至关重要。特征归因方法通过将重要性分配给单个输入特征来提供模型预测的局部解释,因而备受关注。本研究探讨了特征归因在不同深度学习架构(如卷积神经网络和视觉Transformer)中的泛化能力。我们旨在评估将特征归因方法用作未来检测器的可行性,并研究这些特征如何在多个采用不同架构但基于相同数据分布训练的模型之间实现协调。通过探索这种协调机制,我们旨在建立对特征归因更连贯且乐观的理解,从而增强跨不同深度学习模型的局部解释一致性。我们的研究结果凸显了协调特征归因方法在提升可解释性方面潜力,无论底层架构如何,均能促进对机器学习应用的信任。