We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at https://indicodatasolutions.github.io/RealKIE/ code to reproduce the baselines will be available shortly.
翻译:我们引入了RealKIE,这是一个由五个具有挑战性的数据集组成的基准测试,旨在推进关键信息提取方法的研究,尤其聚焦于企业应用场景。这些数据集涵盖多种文档类型,包括SEC S1申报文件、美国保密协议、英国慈善报告、FCC发票及资源合同。每个数据集均呈现独特挑战:文本序列化质量差、长文档中标注稀疏以及复杂表格布局。这些数据集为投资分析、法律数据处理等关键信息提取任务提供了逼真的测试环境。除数据集介绍外,我们还深入阐述了标注流程、文档处理技术及基线建模方法。本工作有助于开发能够应对实际挑战的NLP模型,并支持面向行业特定问题的信息提取技术进一步研究。标注数据与OCR输出可通过https://indicodatasolutions.github.io/RealKIE/ 下载,复现基线的代码将稍后公开。