Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
翻译:环境、社会和治理(ESG)关键绩效指标(KPIs)用于评估组织在气候变化、温室气体排放、水资源消耗、废弃物管理、人权、多样性与政策等议题上的表现。ESG报告通常通过表格形式传递这些具有重要价值的量化信息。然而,由于表格结构与内容的高度可变性,此类信息的自动化抽取面临显著挑战。本文提出Statements——一种领域无关的新型数据结构,用于抽取量化事实及相关信息。我们将表格转换为Statements定义为一种新的监督式深度学习通用信息抽取任务,并构建了包含超过10万个标注表格的数据集SemTabNet。通过研究一系列基于T5架构的Statement抽取模型,我们最优模型生成的Statements与人工标注结果的相似度达到82%(基线模型仅为21%)。我们将所提模型应用于来自ESG报告的2700余张表格,验证了Statements结构的优越性。Statements的标准化特性使得对大规模ESG报告集合中丰富信息进行探索性数据分析成为可能。