Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
翻译:监管文件编码了大语言模型系统必须遵守的具有法律约束力的义务。然而,将密集且具有层级结构的法律文本转化为机器可读规则,仍是一个成本高昂且依赖专家参与的过程。本文提出De Jure——一种完全自动化、领域无关的流水线,用于从原始文档中抽取结构化监管规则,无需人工标注、领域特定提示或标注黄金数据。De Jure通过四个顺序阶段运行:源文档标准化为结构化Markdown;大语言模型驱动的语义分解为结构化规则单元;跨元数据、定义和规则语义等19个维度的多标准LLM-as-a-judge评估;以及在有限重生成预算内对低分抽取结果进行迭代修复——先修复上游组件,再评估规则单元。我们在涵盖金融、医疗和人工智能治理的三个监管语料库上使用四种模型评估De Jure。在金融领域,De Jure在抽取质量上呈现持续且单调的改进,三次判官引导迭代内即达到峰值性能。De Jure有效泛化至医疗和AI治理领域,在开源和闭源模型上均保持高性能。在基于RAG的下游合规问答评估中,当检索深度为单规则时,基于De Jure抽取规则的响应在73.8%的案例中优于先前工作;在更广泛检索下该比例升至84.0%,证实抽取保真度可直接转化为下游实用性。这些结果表明,显式可解释的评估标准可替代复杂监管领域中的人工标注,为基于监管的大语言模型对齐提供可扩展且可审计的路径。