Evaluating the Explainability of Neural Rankers

Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved results), a question worthy of a thorough inspection is - "how explainable are these models?", which is what this paper aims to evaluate. In particular, we propose a common evaluation platform to systematically evaluate the explainability of any ranking model (the explanation algorithm being identical for all the models that are to be evaluated). In our proposed framework, each model, in addition to returning a ranked list of documents, also requires to return a list of explanation units or rationales for each document. This meta-information from each document is then used to measure how locally consistent these rationales are as an intrinsic measure of interpretability - one that does not require manual relevance assessments. Additionally, as an extrinsic measure, we compute how relevant these rationales are by leveraging sub-document level relevance assessments. Our findings show a number of interesting observations, such as sentence-level rationales are more consistent, an increase in complexity mostly leads to less consistent explanations, and that interpretability measures offer a complementary dimension of evaluation of IR systems because consistency is not well-correlated with nDCG at top ranks.

翻译：信息检索模型经历了从无监督统计方法到基于特征的有监督方法，再到利用大型语言模型预训练的完全数据驱动方法的范式转变。尽管搜索模型的日益复杂性已能在有效性（以检索结果顶部的相关性衡量）上展现进步，但一个值得深入探究的问题是——“这些模型的可解释性如何？”这正是本文旨在评估的内容。具体而言，我们提出了一个通用的评估平台，用于系统性地评估任意排序模型的可解释性（对于待评估的所有模型，解释算法保持一致）。在我们的框架中，每个模型除了返回一个排序的文档列表外，还需要为每个文档返回一组解释单元或理由。利用这些来自每个文档的元信息，我们通过衡量这些理由的局部一致性，作为内在的可解释性指标——该指标无需人工相关性评估。此外，作为外在指标，我们借助文档子级别的相关性评估来计算这些理由的相关性。我们的研究发现了一系列有趣的观察结果，例如句子级别的理由更为一致，模型复杂性增加通常会导致解释不一致性增加，以及可解释性指标为信息检索系统的评估提供了补充维度，因为在顶级排名中，一致性与nDCG的相关性并不强。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日