LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.

翻译：大型语言模型（LLMs）在法律信息检索中的应用日益广泛。然而，其在多语言法律场景中的部署受到检索可靠性不足以及缺乏领域适配的开放嵌入模型的限制。具体而言，现有的多语言法律语料库并非为语义检索而设计，且基于PDF的立法源文件因文本提取不完善而引入大量噪声。为应对这些挑战，我们提出了LEMUR——一个基于24,953份涵盖25种语言的欧盟官方EUR-Lex PDF文件构建的大规模多语言环境立法语料库。我们通过词汇内容分数（LCS）对比权威HTML版本，量化了PDF到文本转换的保真度。基于LEMUR语料库，我们在单语和双语场景下采用对比学习目标，对三种前沿多语言嵌入模型进行微调，以模拟真实法律检索场景。在低资源与高资源语言上的实验表明，相较于强基线模型，法律领域微调能持续提升Top-k检索准确率，其中低资源语言的提升尤为显著。跨语言评估显示，这些改进可迁移至未见语言，表明微调主要增强了语言无关的内容级法律表征，而非语言特定的特征线索。我们公开了代码\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub仓库}}与数据\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face数据集}}。