Finding important features that contribute to the prediction of neural models is an active area of research in explainable AI. Neural models are opaque and finding such features sheds light on a better understanding of their predictions. In contrast, in this work, we present an inverse perspective of distractor features: features that cast doubt about the prediction by affecting the model's confidence in its prediction. Understanding distractors provide a complementary view of the features' relevance in the predictions of neural models. In this paper, we apply a reduction-based technique to find distractors and provide our preliminary results of their impacts and types. Our experiments across various tasks, models, and datasets of code reveal that the removal of tokens can have a significant impact on the confidence of models in their predictions and the categories of tokens can also play a vital role in the model's confidence. Our study aims to enhance the transparency of models by emphasizing those tokens that significantly influence the confidence of the models.
翻译:寻找影响神经模型预测的重要特征是当前可解释人工智能研究的热点。神经模型具有不透明性,通过发现这些特征有助于更好地理解其预测机制。与此相反,本文提出干扰特征的逆视角:那些通过影响模型对其预测的置信度而引发质疑的特征。理解干扰因素为揭示特征在神经模型预测中的相关性提供了补充视角。本文采用基于约简的技术来识别干扰项,并初步展示了其影响类型与作用效果。我们在多种代码任务、模型和数据集上的实验表明:移除标记会显著影响模型预测的置信度,而标记的类别同样在模型置信度中扮演关键角色。本研究旨在通过强调那些显著影响模型置信度的标记来提升模型的透明度。