Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.
翻译:跨国企业日益需要跨司法管辖区的合同审查,然而现有法律自然语言处理数据集大多局限于单一司法管辖区。我们提出LAUKIN(澳大利亚、英国和印度法律等价数据集),该数据集包含标注布尔法律等价性的条款对(澳-英、英-印、印-澳)。我们开发了一种新颖的多阶段检索与重排序流水线来构建初始条款对映射,随后由法律专家对部分条款对进行等价或非等价的标注。该数据集包含来自204份合同(涵盖8种协议类型)的14,727个条款对,其中3,000个条款对经人工标注:900个训练集、600个验证集和1,500个测试集。我们评估了4种技术下的12个模型,取得了最高宏平均F1值65.11%,将LAUKIN确立为具有挑战性的基准。结果表明,尽管具有共同的法律渊源,但不同司法管辖区的起草惯例存在显著差异,使得跨司法管辖区的等价性分类具有相当难度。LAUKIN还包含11,727个无标注训练对,以支持未来法律自然语言处理领域的半监督学习研究。