Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.
翻译:信息检索模型经历了从无监督统计方法到基于特征的有监督方法,再到利用大型语言模型预训练的完全数据驱动方法的范式转变。尽管搜索模型的日益复杂性已能在有效性(以检索结果顶部的相关性衡量)上展现进步,但一个值得深入探究的问题是——“这些模型的可解释性如何?”这正是本文旨在评估的内容。具体而言,我们提出了一个通用的评估平台,用于系统性地评估任意排序模型的可解释性(对于待评估的所有模型,解释算法保持一致)。在我们的框架中,每个模型除了返回一个排序的文档列表外,还需要为每个文档返回一组解释单元或理由。利用这些来自每个文档的元信息,我们通过衡量这些理由的局部一致性,作为内在的可解释性指标——该指标无需人工相关性评估。此外,作为外在指标,我们借助文档子级别的相关性评估来计算这些理由的相关性。我们的研究发现了一系列有趣的观察结果,例如句子级别的理由更为一致,模型复杂性增加通常会导致解释不一致性增加,以及可解释性指标为信息检索系统的评估提供了补充维度,因为在顶级排名中,一致性与nDCG的相关性并不强。