ExaRanker recently introduced an approach to training information retrieval (IR) models, incorporating natural language explanations as additional labels. The method addresses the challenge of limited labeled examples, leading to improvements in the effectiveness of IR models. However, the initial results were based on proprietary language models such as GPT-3.5, which posed constraints on dataset size due to its cost and data privacy. In this paper, we introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. The method has been tested using different LLMs and datasets sizes to better comprehend the effective contribution of data augmentation. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases. Notably, the data augmentation method proves advantageous even with large datasets, as evidenced by ExaRanker surpassing the target baseline by 0.6 nDCG@10 points in our study. To encourage further advancements by the research community, we have open-sourced both the code and datasets at https://github.com/unicamp-dl/ExaRanker.
翻译:ExaRanker近期提出了一种训练信息检索模型的方法,将自然语言解释作为额外标签引入。该方法解决了标注示例有限的问题,从而提升了信息检索模型的效果。然而,初步结果依赖于GPT-3.5等专有语言模型,其高昂成本和数据隐私限制制约了数据集规模。本文提出ExaRanker-Open,通过适配并探索使用开源语言模型生成解释。我们采用不同规模的大语言模型(LLM)和数据集进行测试,以更深入理解数据增强的实际贡献。研究结果表明,引入解释能持续增强神经排序器,且收益随LLM规模增大而提升。值得注意的是,即使在大规模数据集上,数据增强方法仍具优势——本研究中ExaRanker在nDCG@10指标上超越目标基线0.6个点。为促进研究社区的进一步发展,我们已将代码和数据集开源至https://github.com/unicamp-dl/ExaRanker。