With the emergence of Large Language Models (LLMs), new methods in Information Retrieval are available in which relevance is estimated directly through language understanding and reasoning, instead of embedding similarity. We argue that similarity is a short-sighted interpretation of relevance, and that LLM-Based Relevance Judgment Systems (LLM-RJS) (with reasoning) have potential to outperform Neural Embedding Retrieval Systems (NERS) by overcoming this limitation. Using the TREC-DL 2019 passage retrieval dataset, we compare various LLM-RJS with NERS, but observe no noticeable improvement. Subsequently, we analyze the impact of reasoning by comparing LLM-RJS with and without reasoning. We find that human annotations also suffer from short-sightedness, and that false-positives in the reasoning LLM-RJS are primarily mistakes in annotations due to short-sightedness. We conclude that LLM-RJS do have the ability to address the short-sightedness limitation in NERS, but that this cannot be evaluated with standard annotated relevance datasets.
翻译:随着大型语言模型(LLMs)的出现,信息检索领域出现了新的方法,这些方法通过直接的语言理解和推理来估计相关性,而非依赖嵌入相似度。我们认为,相似度是对相关性的一种短视解读,而基于LLM的相关性判断系统(LLM-RJS)(具备推理能力)有潜力通过克服这一局限来超越神经嵌入检索系统(NERS)。利用TREC-DL 2019段落检索数据集,我们比较了多种LLM-RJS与NERS,但未观察到显著改进。随后,我们通过比较具备与不具备推理能力的LLM-RJS,分析了推理的影响。我们发现人工标注同样受短视问题影响,且具备推理能力的LLM-RJS中的误报主要源于短视导致的标注错误。我们得出结论:LLM-RJS确实具备解决NERS中短视局限的能力,但这一能力无法通过标准的标注相关性数据集进行评估。