Multilingual retrieval increasingly underpins cross-lingual question answering and retrieval-augmented generation. Strong zero-shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late-interaction, learned sparse, and cross-encoder paradigms, we compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperforms the strongest monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision yields 32-60% relative MRR@10 gains over zero-shot, but the best Amharic-fine-tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero-shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in-language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release the dataset, codebase, and trained models at https://github.com/rasyosef/amharic-neural-ir.
翻译:多语言检索日益成为跨语言问答和检索增强生成的基础。多语言基准测试中的强零样本分数常被视为当前编码器能够可靠地在多种语言间迁移的证据。我们认为,这一假设对代表性不足且形态丰富的语言而言并不成立,并以阿姆哈拉语作为诊断案例。在涵盖稠密、晚期交互、学习型稀疏和交叉编码器范式的共享段落检索协议下,我们比较了零样本多语言检索器、阿姆哈拉语微调后的多语言检索器以及单语阿姆哈拉语检索器。最强零样本多语言检索器的性能在相对MRR@10指标上比最强单语阿姆哈拉语一级检索器低23%。在相同的阿姆哈拉语监督数据上微调两个最新的多语言嵌入模型,相比零样本获得了32-60%的相对MRR@10提升,但最佳阿姆哈拉语微调多语言模型仍低于最强单语阿姆哈拉语检索器。这些发现表明,在大语言模型时代,零样本多语言检索并非公平信息获取的充分代理:对于代表性不足的语言,检索必须基于该语言本身进行评估和适配,而非从汇总的多语言基准测试中推断。为促进未来研究,我们已在https://github.com/rasyosef/amharic-neural-ir上公开发布数据集、代码库和训练模型。