Table Detection has become a fundamental task for visually rich document understanding with the surging number of electronic documents. There have been some open datasets widely used in many studies. However, popular available datasets have some inherent limitations, including the noisy and inconsistent samples, and the limit number of training samples, and the limit number of data-sources. These limitations make these datasets unreliable to evaluate the model performance and cannot reflect the actual capacity of models. Therefore, in this paper, we revisit some open datasets with high quality of annotations, identify and clean the noise, and align the annotation definitions of these datasets to merge a larger dataset, termed with Open-Tables. Moreover, to enrich the data sources, we propose a new dataset, termed with ICT-TD, using the PDF files of Information and communication technologies (ICT) commodities which is a different domain containing unique samples that hardly appear in open datasets. To ensure the label quality of the dataset, we annotated the dataset manually following the guidance of a domain expert. The proposed dataset has a larger intra-variance and smaller inter-variance, making it more challenging and can be a sample of actual cases in the business context. We built strong baselines using various state-of-the-art object detection models and also built the baselines in the cross-domain setting. Our experimental results show that the domain difference among existing open datasets are small, even they have different data-sources. Our proposed Open-tables and ICT-TD are more suitable for the cross domain setting, and can provide more reliable evaluation for model because of their high quality and consistent annotations.
翻译:表格检测已成为视觉丰富文档理解中的基础任务,随着电子文档数量的激增,已有多个开放数据集被广泛应用于研究。然而,现有流行数据集存在一些固有缺陷,包括样本噪声和不一致、训练样本数量有限以及数据来源不足。这些局限性使得这些数据集无法可靠地评估模型性能,也无法反映模型的真实能力。因此,本文重新审视了一些高质量标注的开放数据集,识别并清理其中的噪声,统一这些数据集的标注定义,合并构建了一个更大的数据集,命名为Open-Tables。此外,为丰富数据来源,我们提出一个新的数据集ICT-TD,该数据集使用信息与通信技术(ICT)商品领域的PDF文件,其中包含与开放数据集很少重合的独特样本。为确保数据集标签质量,我们按照领域专家的指导进行了人工标注。所提出的数据集具有较大的类内方差和较小的类间方差,使其更具挑战性,并能代表商业场景中的实际案例。我们使用多种最先进的目标检测模型构建了强基线,并在跨领域设置下建立了基线。实验结果表明,现有开放数据集之间的领域差异较小,即使它们来源不同。我们提出的Open-Tables和ICT-TD更适用于跨领域设置,并因其高质量和一致的标注,能为模型提供更可靠的评估。