Deterministic Fuzzy Triage for Legal Compliance Classification and Evidence Retrieval

from arxiv, 8 pages, 5 figures. Published in the Proceedings of the AAAI Bridge between Artificial Intelligence and Law 2026 (Full papers), pages 51-58

Legal teams increasingly use machine learning to triage large volumes of contractual evidence, but many models are opaque, non-deterministic, and difficult to align with frameworks such as HIPAA or NERC-CIP. We study a simple, reproducible alternative based on deterministic dual encoders and transparent fuzzy triage bands. We train a RoBERTa-base dual encoder with a 512-dimensional projection and cosine similarity on the ACORD benchmark for graded clause retrieval, then fine-tune it on a CUAD-derived binary compliance dataset. Across five random seeds (40-44) on a single NVIDIA A100 GPU, the model achieves ACORD-style retrieval performance of NDCG@5 0.38-0.42, NDCG@10 0.45-0.50, and 4-star Precision@5 about 0.37 on the test split. On CUAD-derived binary labels, it achieves AUC 0.98-0.99 and F1 0.22-0.30 depending on positive-class weighting, outperforming majority and random baselines in a highly imbalanced setting with a positive rate of about 0.6%. We then map scalar compliance scores into three regions: auto-noncompliant, auto-compliant, and human-review. Thresholds are tuned on validation data to maximize automatic decision coverage subject to an empirical error-rate constraint of at most 2% over auto-decided examples. The result is a seed-stable system summarized by a small number of scalar parameters. We argue that deterministic encoders, calibrated fuzzy bands, and explicit error constraints provide a practical middle ground between hand-crafted rules and opaque large language models, supporting explainable evidence triage, reproducible audit trails, and concrete mappings to legal review concepts.

翻译：法律团队越来越多地使用机器学习对大量合同证据进行分类，但许多模型不透明、非确定性且难以与HIPAA或NERC-CIP等框架对齐。我们研究了一种基于确定性双编码器和透明模糊分类带的简单、可复现替代方案。我们在ACORD分级条款检索基准上，使用512维投影和余弦相似度训练了RoBERTa-base双编码器，随后在CUAD衍生的二元合规数据集上进行微调。在单张NVIDIA A100 GPU上使用五个随机种子（40-44）进行测试，该模型在测试集上实现了ACORD风格检索性能：NDCG@5为0.38-0.42，NDCG@10为0.45-0.50，4-star Precision@5约为0.37。在CUAD衍生的二元标签上，根据正类权重调整，模型AUC达到0.98-0.99，F1分数为0.22-0.30，在正例率约0.6%的高度不平衡场景中优于多数类和随机基线。随后我们将标量合规分数映射至三个区域：自动不合规、自动合规和人工审核。通过在验证数据上调整阈值，在满足自动判定样本经验错误率不超过2%的约束条件下，最大化自动决策覆盖率。最终形成由少量标量参数定义的种子稳定系统。我们认为，确定性编码器、校准模糊带和显式错误约束为手工规则与不透明大语言模型之间提供了实用折中方案，支持可解释的证据分类、可复现的审计追踪以及与法律审查概念的具体映射。