STARD: A Chinese Statute Retrieval Dataset with Real Queries Issued by Non-professionals

Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks focus on formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area. All the codes and datasets are available at: https://github.com/oneal2000/STARD/tree/main

翻译：法规检索旨在为特定查询找到相关的法律条文。这一过程是法律咨询、自动化司法判决、法律文书起草等多种法律应用的基础。现有的法规检索基准主要关注来自司法考试和法律案件文档等形式化、专业化的查询，从而忽视了来自公众的非专业查询，这些查询往往缺乏精确的法律术语和引用。为填补这一空白，我们引入了法规检索数据集（STARD），这是一个中文数据集，包含从真实法律咨询中收集的1,543个查询案例以及55,348个候选法律条文。与现有主要关注专业法律查询的法规检索数据集不同，STARD捕捉了来自公众的真实查询的复杂性和多样性。通过对多种检索基线方法的全面评估，我们发现现有检索方法在处理非专业用户提出的这些真实查询时均存在不足。最佳方法仅能达到0.907的Recall@100，这表明该领域需要进一步的探索和更多的研究。所有代码和数据集均可在以下网址获取：https://github.com/oneal2000/STARD/tree/main

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日