The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touch\'e 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touch\'e 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touch\'e 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touch\'e 2020 data. As many of the short Touch\'e passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touch\'e 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touch\'e guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touch\'e 2020 dataset are available at \url{https://github.com/castorini/touche-error-analysis}.
翻译:神经检索模型的零样本有效性通常在BEIR基准上进行评估——该基准结合了不同的信息检索评估数据集。有趣的是,先前研究发现,特别是在BEIR的子集Touché 2020(一项论据检索任务)上,神经检索模型的有效性显著低于BM25。然而,迄今为止,尚未对导致论据检索如此“特殊”的原因进行进一步调查。为了更深入地分析神经检索模型各自的潜在局限性,我们在Touché 2020数据上进行了一项可重复性研究。在我们的研究中,我们专注于两个实验:(i)黑盒评估(即不进行模型重新训练),结合使用检索公理的理论探索;(ii)涉及事后相关性判断的数据去噪评估。我们的黑盒评估揭示了神经模型对从Touché 2020数据中检索短段落的内在偏见,并且我们还发现神经模型的结果中有相当一部分在Touché 2020数据中未经过判断。由于许多Touché短段落并非论据性内容,因此本身就不相关,并且由于缺失的判断使得公平比较变得复杂,我们通过排除非常短的段落(少于20个词)并按照Touché指南用事后判断来增强未判断数据,从而对Touché 2020数据进行去噪。在去噪后的数据上,神经模型的有效性在nDCG@10上提高了高达0.52,但BM25仍然更有效。我们的代码和增强的Touché 2020数据集可在\url{https://github.com/castorini/touche-error-analysis}获取。