Synthesizing Realistic Data for Table Recognition

To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.

翻译：为克服当前自动表格数据标注方法与随机表格数据合成方法的局限性与挑战，本文提出一种专门针对表格识别任务的标注数据合成新方法。该方法利用现有复杂表格的结构与内容，能够高效生成高度贴近目标领域真实样式的表格。通过利用中国财务公告中表格的实际结构与内容，我们构建了该领域首个大规模表格标注数据集。基于该数据集，我们训练了多种最新的基于深度学习的端到端表格识别模型。此外，我们建立了中文财务公告领域首个真实场景复杂表格基准测试集，并以此评估了基于合成数据训练模型的性能，从而有效验证了本方法的实用性与有效性。进一步地，我们将本合成方法应用于从英文财务公告中提取的FinTabNet数据集，通过增加含多跨单元表格的比例以引入更高复杂度。实验表明，基于增强数据集训练的模型在性能上取得了全面提升，尤其在含多跨单元表格的识别任务中表现显著。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日