This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
翻译:本文对7种现成的文档检索模型进行了全面评估:Splade、Plaid、Plaid-X、SimCSE、Contriever、OpenAI ADA和Gemma2,旨在确定它们在捷克语检索数据集DaReCzech上的性能。实验的主要目标是评估现代检索方法在捷克语中的质量,分析内容包括检索质量、速度和内存占用。其次,我们分析了直接使用捷克语文本模型与先将文本机器翻译为英语再进行英语检索两种方案的优劣。实验确定了捷克语信息检索的最有效方案。研究结果显示各模型性能存在显著差异,其中Gemma2在精确率和召回率上表现最佳,而Contriever表现较差。总体而言,SPLADE和PLAID模型在效率与性能之间取得了最佳平衡。