Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.
翻译:摘要:稀疏编码器通过在词汇空间中表示词条重要性来实现高精度检索,但其以英语为中心的结构对非英语语言的迁移构成了关键障碍。为克服这一结构性限制,我们提出SemBridge——一种面向稀疏编码器跨语言适应的新型嵌入初始化方法,该方法通过利用多语言桥接模型实现。SemBridge利用多语言稠密嵌入作为桥梁,建立源词汇与目标词汇之间的语义对齐。不同于直接依赖所有源词条,SemBridge选取一小批语义相关的源语言词条,并以此初始化每个目标语言词条,从而有效滤除语义噪声,将目标词条重构为核心同义词的精确线性组合。这加速了微调过程中的收敛,并提升了训练效率。跨五种语言及四种稀疏架构的大量实验表明,与现有基线相比,SemBridge在零样本检索任务中表现更优,且在微调后持续提升检索性能。这些结果验证了SemBridge作为在多样化语言环境中部署高性能稀疏检索系统的实用解决方案。