Revisiting Table Detection Datasets for Visually Rich Documents

Table Detection has become a fundamental task for visually rich document understanding with the surging number of electronic documents. There have been some open datasets widely used in many studies. However, popular available datasets have some inherent limitations, including the noisy and inconsistent samples, and the limit number of training samples, and the limit number of data-sources. These limitations make these datasets unreliable to evaluate the model performance and cannot reflect the actual capacity of models. Therefore, in this paper, we revisit some open datasets with high quality of annotations, identify and clean the noise, and align the annotation definitions of these datasets to merge a larger dataset, termed with Open-Tables. Moreover, to enrich the data sources, we propose a new dataset, termed with ICT-TD, using the PDF files of Information and communication technologies (ICT) commodities which is a different domain containing unique samples that hardly appear in open datasets. To ensure the label quality of the dataset, we annotated the dataset manually following the guidance of a domain expert. The proposed dataset has a larger intra-variance and smaller inter-variance, making it more challenging and can be a sample of actual cases in the business context. We built strong baselines using various state-of-the-art object detection models and also built the baselines in the cross-domain setting. Our experimental results show that the domain difference among existing open datasets are small, even they have different data-sources. Our proposed Open-tables and ICT-TD are more suitable for the cross domain setting, and can provide more reliable evaluation for model because of their high quality and consistent annotations.

翻译：表格检测已成为视觉丰富文档理解中的基础任务，随着电子文档数量的激增，已有多个开放数据集被广泛应用于研究。然而，现有流行数据集存在一些固有缺陷，包括样本噪声和不一致、训练样本数量有限以及数据来源不足。这些局限性使得这些数据集无法可靠地评估模型性能，也无法反映模型的真实能力。因此，本文重新审视了一些高质量标注的开放数据集，识别并清理其中的噪声，统一这些数据集的标注定义，合并构建了一个更大的数据集，命名为Open-Tables。此外，为丰富数据来源，我们提出一个新的数据集ICT-TD，该数据集使用信息与通信技术（ICT）商品领域的PDF文件，其中包含与开放数据集很少重合的独特样本。为确保数据集标签质量，我们按照领域专家的指导进行了人工标注。所提出的数据集具有较大的类内方差和较小的类间方差，使其更具挑战性，并能代表商业场景中的实际案例。我们使用多种最先进的目标检测模型构建了强基线，并在跨领域设置下建立了基线。实验结果表明，现有开放数据集之间的领域差异较小，即使它们来源不同。我们提出的Open-Tables和ICT-TD更适用于跨领域设置，并因其高质量和一致的标注，能为模型提供更可靠的评估。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

17+阅读 · 2022年3月13日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日