Neural models have demonstrated remarkable performance across diverse ranking tasks. However, the processes and internal mechanisms along which they determine relevance are still largely unknown. Existing approaches for analyzing neural ranker behavior with respect to IR properties rely either on assessing overall model behavior or employing probing methods that may offer an incomplete understanding of causal mechanisms. To provide a more granular understanding of internal model decision-making processes, we propose the use of causal interventions to reverse engineer neural rankers, and demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms within a ranking model. We identify a group of attention heads that detect duplicate tokens in earlier layers of the model, then communicate with downstream heads to compute overall document relevance. More generally, we propose that this style of mechanistic analysis opens up avenues for reverse engineering the processes neural retrieval models use to compute relevance. This work aims to initiate granular interpretability efforts that will not only benefit retrieval model development and training, but ultimately ensure safer deployment of these models.
翻译:神经模型在各类排序任务中展现了卓越性能。然而,它们判定相关性的过程及内部机制在很大程度上仍不为人知。现有分析神经排序器在信息检索属性方面行为的方法,要么依赖于评估整体模型行为,要么采用可能对因果机制提供不完整理解的探测方法。为了更精细地理解模型内部决策过程,我们提出使用因果干预对神经排序器进行逆向工程,并展示如何利用机械可解释性方法来隔离排序模型中满足词频公理的组件。我们识别出一组注意力头,它们在模型早期层检测重复标记,然后与下游头通信以计算整体文档相关性。更广泛地,我们提出这种机械分析方式为逆向工程神经检索模型用于计算相关性的过程开辟了新途径。这项工作旨在启动精细的可解释性研究,这不仅将有益于检索模型的开发与训练,而且最终将确保这些模型的安全部署。