This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
翻译:本文对七种现成的文档检索模型进行了全面评估:Splade、Plaid、Plaid-X、SimCSE、Contriever、OpenAI ADA和Gemma2,旨在确定它们在捷克语检索数据集DaReCzech上的性能。实验的主要目标是评估现代检索方法在捷克语中的质量。我们的分析涵盖检索质量、速度和内存占用。其次,我们分析了是直接在捷克语文本上使用模型更好,还是先通过机器翻译转换为英语,再进行英语检索更好。实验确定了捷克语信息检索的最有效方案。研究结果显示,各模型之间存在显著的性能差异,其中Gemma2实现了最高的精确率和召回率,而Contriever表现不佳。总体而言,SPLADE和PLAID模型在效率与性能之间取得了最佳平衡。