Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions, while achieving 3$\times$ lower retrieval latency and 10$\times$ smaller index size.
翻译:学习型稀疏检索结合了双编码器的效率与词汇匹配的透明度,但现有方法难以扩展到英语之外。我们提出MILCO,一种通过多语言连接器将不同语言的查询和文档映射到共享英语词汇空间的学习型稀疏检索架构。MILCO采用专门的两阶段训练机制,将稀疏对齐预训练与对比训练相结合,在缓解语义坍缩的同时实现表示透明度与有效性。受"罕见实体投射到英语时常常丢失"这一观察启发,我们提出新型LexEcho头,通过一个特殊的[ECHO]标记引入源语言视角来增强英语词汇表示,从而提升鲁棒性。MILCO在标准多语言基准测试中取得了最先进的多语言与跨语言学习型稀疏检索性能,超越了BGE-M3和Qwen3-Embed等领先的稠密、稀疏及多向量基线方法,同时支持基于事后剪枝的动态效率优化。值得注意的是,当采用基于质量的剪枝将文档表示平均压缩至仅30个活跃维度时,MILCO 560M在性能上超越1024维的相似规模模型Qwen3-Embed 0.6B,同时实现3倍检索延迟降低和10倍索引体积缩减。