To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval

from arxiv, This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in ECIR2026 (Part I) Advances in Information Retrieval

Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre-processing the text to lowercase. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: https://github.com/lionisakis/Uncased-vs-cased-models-in-LSR

翻译：学习式稀疏检索（LSR）方法通过构建查询与文档的稀疏词汇表示，使其能够利用倒排索引进行高效检索。现有的LSR方法几乎完全依赖不区分大小写的骨干模型，其词汇表忽略大小写差异，从而减少了词汇失配问题。然而，当前最先进的语言模型仅提供区分大小写的版本。尽管这一趋势已经转变，骨干模型的大小写特性对LSR的影响尚未得到研究，这可能对方法未来的可行性构成潜在风险。为填补这一空白，我们在多个数据集上系统评估了同一骨干模型的大小写敏感版本与不敏感版本，以评判它们对LSR的适用性。研究发现，默认使用区分大小写骨干模型的LSR模型性能显著低于其不区分大小写的对应版本；但通过对文本进行小写预处理，这一性能差距可被完全消除。此外，我们的词元级分析表明，在小写化处理下，区分大小写模型几乎完全抑制了大小写敏感词汇项，其行为实质上等效于不区分大小写模型，这解释了其性能恢复的原因。该结果扩展了近期区分大小写模型在LSR场景中的适用性，并促进了更强骨干架构与稀疏检索的集成。本项目的完整代码与实现已发布于：https://github.com/lionisakis/Uncased-vs-cased-models-in-LSR