Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present JUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, JUÁ is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JUÁ-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned JUÁ-Juris subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, JUÁ provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.
翻译:葡萄牙语法律信息检索的系統性评估仍面临挑战,根源在于现有数据集在文档类型、查询风格和相关性定义上存在显著差异。我们提出JUÁ——一个面向巴西法律检索的公共基准,旨在支持异构法律语料库的更可复现、更具可比性的评估。更广泛而言,JUÁ不仅作为基准,更致力于成为巴西法律信息检索的持续性评估基础设施,整合共享协议、通用排序指标、适用场景下的固定数据划分方案及公开排行榜。该基准涵盖判例检索,以及更广泛的立法、法规和问题驱动型法律搜索。我们评估了基于词汇、稠密向量和BM25重排序的多种检索管线,包括在JUÁ对齐监督数据上微调的领域自适应Qwen嵌入模型。结果表明,该基准具有足够的异构性以区分不同检索范式,并揭示出跨数据集的显著性能权衡。领域自适应在监督对齐的JUÁ-Juris子集上取得最明显增益,而BM25在其他语料库(尤其具有强词汇和制度性短语线索的场景)中仍保持高度竞争力。总体而言,JUÁ为在统一基准设计下研究巴西多领域法律检索提供了实用评估框架。