Artificial Intelligence (AI) systems are increasingly used in high-stakes domains of our life, increasing the need to explain these decisions and to make sure that they are aligned with how we want the decision to be made. The field of Explainable AI (XAI) has emerged in response. However, it faces a significant challenge known as the disagreement problem, where multiple explanations are possible for the same AI decision or prediction. While the existence of the disagreement problem is acknowledged, the potential implications associated with this problem have not yet been widely studied. First, we provide an overview of the different strategies explanation providers could deploy to adapt the returned explanation to their benefit. We make a distinction between strategies that attack the machine learning model or underlying data to influence the explanations, and strategies that leverage the explanation phase directly. Next, we analyse several objectives and concrete scenarios the providers could have to engage in this behavior, and the potential dangerous consequences this manipulative behavior could have on society. We emphasize that it is crucial to investigate this issue now, before these methods are widely implemented, and propose some mitigation strategies.
翻译:人工智能系统越来越多地应用于我们生活中的高风险领域,这增加了对这些决策进行解释并确保其符合我们期望决策方式的需求。可解释人工智能领域应运而生。然而,该领域面临一个重大挑战,即分歧问题——同一人工智能决策或预测可能存在多种解释。尽管分歧问题的存在已被认可,但与之相关的潜在影响尚未得到广泛研究。首先,我们概述了解释提供者可能部署的不同策略,以调整所返回的解释使其对自己有利。我们区分了攻击机器学习模型或底层数据以影响解释的策略,以及直接利用解释阶段的策略。接着,我们分析了提供者可能参与此类行为的若干目标和具体场景,以及这种操纵性行为可能对社会造成的危险后果。我们强调,在方法广泛实施之前立即研究这一问题至关重要,并提出了一些缓解策略。