The statistical analysis of large scale legal corpus can provide valuable legal insights. For such analysis one needs to (1) select a subset of the corpus using document retrieval tools, (2) structuralize text using information extraction (IE) systems, and (3) visualize the data for the statistical analysis. Each process demands either specialized tools or programming skills whereas no comprehensive unified "no-code" tools have been available. Especially for IE, if the target information is not predefined in the ontology of the IE system, one needs to build their own system. Here we provide NESTLE, a no code tool for large-scale statistical analysis of legal corpus. With NESTLE, users can search target documents, extract information, and visualize the structured data all via the chat interface with accompanying auxiliary GUI for the fine-level control. NESTLE consists of three main components: a search engine, an end-to-end IE system, and a Large Language Model (LLM) that glues the whole components together and provides the chat interface. Powered by LLM and the end-to-end IE system, NESTLE can extract any type of information that has not been predefined in the IE system opening up the possibility of unlimited customizable statistical analysis of the corpus without writing a single line of code. The use of the custom end-to-end IE system also enables faster and low-cost IE on large scale corpus. We validate our system on 15 Korean precedent IE tasks and 3 legal text classification tasks from LEXGLUE. The comprehensive experiments reveal NESTLE can achieve GPT-4 comparable performance by training the internal IE module with 4 human-labeled, and 192 LLM-labeled examples. The detailed analysis provides the insight on the trade-off between accuracy, time, and cost in building such system.
翻译:大规模法律语料的统计分析能够提供宝贵的法律洞见。此类分析需要:(1) 通过文档检索工具选取语料子集,(2) 利用信息抽取系统将文本结构化,(3) 为统计分析可视化数据。每个流程均需专业工具或编程技能,而此前尚无统一的无代码综合工具。尤其在信息抽取方面,若目标信息未在IE系统的本体中预定义,用户需自行构建系统。本文提出NESTLE——一种用于法律语料大规模统计分析的无代码工具。借助NESTLE,用户可通过聊天界面(辅以提供精细控制的图形用户界面)实现目标文档检索、信息抽取及结构化数据可视化。NESTLE由三大核心组件构成:搜索引擎、端到端信息抽取系统,以及将各组件整合并提供聊天界面的大语言模型。基于LLM与端到端IE系统的驱动,NESTLE可抽取IE系统中未预定义的任意类型信息,从而无需编写任何代码即可实现语料库的无限定制化统计分析。专用端到端IE系统的应用还支持对大规模语料进行低成本、高效率的信息抽取。我们通过15项韩语先例信息抽取任务及LEXGLUE中的3项法律文本分类任务验证了该系统。综合实验表明,通过训练包含4个人工标注样本和192个LLM标注样本的内部IE模块,NESTLE可实现与GPT-4相当的性能。详细分析为构建此类系统时准确率、时间与成本之间的权衡提供了洞见。