Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present \textsc{JUÁ}, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, \textsc{JUÁ} is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on \textsc{JUÁ}-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned \textsc{JUÁ-Juris} subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, \textsc{JUÁ} provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.
翻译:葡萄牙语法律信息检索的系统性评估仍面临困难,因为现有数据集在文档类型、查询风格和相关性定义上差异显著。本文提出\textsc{JUÁ}这一面向巴西法律检索的公开基准,旨在支持跨异构法律语料库的更可重复与可比较评估。广义上,\textsc{JUÁ}不仅是一个基准,更是巴西法律信息检索的持续评估基础设施,融合共享协议、通用排序指标、适用场景下的固定数据划分及公开排行榜。该基准涵盖判例检索,以及更广泛的立法、法规和问题驱动的法律搜索。我们评估了词汇型、稠密型及基于BM25的重排序流水线,包括经\textsc{JUÁ}对齐监督数据微调的领域自适应Qwen嵌入模型。结果表明,该基准具有足够异质性以区分检索范式,并揭示了显著的跨数据集权衡。领域自适应在监督对齐的\textsc{JUÁ-Juris}子集上增益最为显著,而BM25在其他语料库上仍保持强竞争力,尤其在具有强词汇和机构化措辞线索的场景中。总体而言,\textsc{JUÁ}为在统一基准设计下研究跨多个巴西法律领域的检索提供了实用评估框架。