This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings. Our dataset features spoken-language-style legal inquiries in English, paired with corresponding Chinese versions and relevant statutes, covering all Taiwanese civil, criminal, and administrative laws. This dataset aims to improve access to legal information for non-native speakers, particularly for foreign nationals in Taiwan. We propose several LLM-based methods as baselines for evaluating retrieval effectiveness, focusing on mitigating translation errors and improving cross-lingual retrieval performance. Our work provides a valuable resource for developing inclusive legal information retrieval systems.
翻译:本文介绍了一个旨在增强多语言环境下法律信息检索能力的跨语言法条检索数据集。我们的数据集包含以口语化风格提出的英文法律咨询问题,并配有相应的中文版本及相关法条,覆盖了台湾地区所有民事、刑事及行政法律。该数据集旨在为非母语使用者,特别是在台湾的外籍人士,改善法律信息的获取途径。我们提出了几种基于大语言模型的方法作为评估检索效果的基线,重点关注减少翻译错误并提升跨语言检索性能。我们的工作为开发包容性法律信息检索系统提供了宝贵的资源。