To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.
翻译:为克服当前自动表格数据标注方法与随机表格数据合成方法的局限性与挑战,本文提出一种专门针对表格识别任务的标注数据合成新方法。该方法利用现有复杂表格的结构与内容,能够高效生成高度贴近目标领域真实样式的表格。通过利用中国财务公告中表格的实际结构与内容,我们构建了该领域首个大规模表格标注数据集。基于该数据集,我们训练了多种最新的基于深度学习的端到端表格识别模型。此外,我们建立了中文财务公告领域首个真实场景复杂表格基准测试集,并以此评估了基于合成数据训练模型的性能,从而有效验证了本方法的实用性与有效性。进一步地,我们将本合成方法应用于从英文财务公告中提取的FinTabNet数据集,通过增加含多跨单元表格的比例以引入更高复杂度。实验表明,基于增强数据集训练的模型在性能上取得了全面提升,尤其在含多跨单元表格的识别任务中表现显著。