Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touch\'e 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touch\'e 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touch\'e 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touch\'e 2020 data. As many of the short Touch\'e passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touch\'e 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touch\'e guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touch\'e 2020 dataset are available at \url{https://github.com/castorini/touche-error-analysis}.

翻译：神经检索模型的零样本有效性通常在BEIR基准上进行评估——该基准结合了不同的信息检索评估数据集。有趣的是，先前研究发现，特别是在BEIR的子集Touché 2020（一项论据检索任务）上，神经检索模型的有效性显著低于BM25。然而，迄今为止，尚未对导致论据检索如此“特殊”的原因进行进一步调查。为了更深入地分析神经检索模型各自的潜在局限性，我们在Touché 2020数据上进行了一项可重复性研究。在我们的研究中，我们专注于两个实验：（i）黑盒评估（即不进行模型重新训练），结合使用检索公理的理论探索；（ii）涉及事后相关性判断的数据去噪评估。我们的黑盒评估揭示了神经模型对从Touché 2020数据中检索短段落的内在偏见，并且我们还发现神经模型的结果中有相当一部分在Touché 2020数据中未经过判断。由于许多Touché短段落并非论据性内容，因此本身就不相关，并且由于缺失的判断使得公平比较变得复杂，我们通过排除非常短的段落（少于20个词）并按照Touché指南用事后判断来增强未判断数据，从而对Touché 2020数据进行去噪。在去噪后的数据上，神经模型的有效性在nDCG@10上提高了高达0.52，但BM25仍然更有效。我们的代码和增强的Touché 2020数据集可在\url{https://github.com/castorini/touche-error-analysis}获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日