Ensuring the trustworthiness and interpretability of machine learning models is critical to their deployment in real-world applications. Feature attribution methods have gained significant attention, which provide local explanations of model predictions by attributing importance to individual input features. This study examines the generalization of feature attributions across various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers. We aim to assess the feasibility of utilizing a feature attribution method as a future detector and examine how these features can be harmonized across multiple models employing distinct architectures but trained on the same data distribution. By exploring this harmonization, we aim to develop a more coherent and optimistic understanding of feature attributions, enhancing the consistency of local explanations across diverse deep-learning models. Our findings highlight the potential for harmonized feature attribution methods to improve interpretability and foster trust in machine learning applications, regardless of the underlying architecture.
翻译:确保机器学习模型的可信度和可解释性对于其在实际应用中的部署至关重要。特征归因方法通过将重要性赋予单个输入特征,提供模型预测的局部解释,因而受到广泛关注。本研究考察了特征归因在不同深度学习架构(如卷积神经网络(CNN)和视觉Transformer)中的泛化能力。我们旨在评估将特征归因方法用作未来检测器的可行性,并探究如何协调使用不同架构但训练于相同数据分布的多个模型间的这些特征。通过探索这种协调,我们旨在形成对特征归因更一致、更乐观的理解,从而增强跨不同深度学习模型的局部解释一致性。我们的研究结果突显了协调特征归因方法在提升可解释性及促进机器学习应用信任方面的潜力,无论底层架构如何。