Rethinking Detection Based Table Structure Recognition for Visually Rich Documents

Table Structure Recognition (TSR) aims at transforming unstructured table images into structured formats, such as HTML sequences. One type of popular solution is using detection models to detect components of a table, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually pay more attention to improving the detection performance, which does not necessarily lead to better performance regarding cell-level metrics, such as TEDS. Second, some solutions over-simplify the problem and can miss some critical information. Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes. Besides, there is often a performance gap between two-stage and transformer-based detection models regarding the structure-only TEDS, even though they have similar performance regarding the COCO metrics. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and identify the key design aspects for the success of a two-stage detection model for the TSR task, including the multi-class problem definition, the aspect ratio for anchor box generation, and the feature generation of the backbone network. We applied simple methods to improve these aspects of the Cascade R-CNN model, achieved state-of-the-art performance, and improved the baseline Cascade R-CNN model by 19.32%, 11.56% and 14.77% regarding the structure-only TEDS on SciTSR, FinTabNet, and PubTables1M datasets.

翻译：表格结构识别旨在将非结构化的表格图像转换为结构化格式，例如HTML序列。一类常用解决方案是使用检测模型识别表格的组件（如列和行），然后通过基于规则的后处理方法将检测结果转换为HTML序列。然而，现有基于检测的研究通常存在以下局限：首先，这些研究往往更关注提升检测性能，但这并不必然提升单元格级指标（如TEDS）的表现；其次，部分方案过度简化问题，可能遗漏关键信息；最后，尽管有些研究通过定义检测更多组件来提供与其他解决方案相当的信息量，但这些研究忽略了该问题定义本质上是多标签检测问题——因为行、投影行标题和列标题可能共享相同的包围框。此外，两阶段检测模型与基于Transformer的检测模型在仅结构TEDS指标上存在性能差距，尽管它们在COCO指标上表现相近。因此，我们重新审视现有基于检测方案的局限性，对比两阶段模型与基于Transformer的检测模型，并识别出两阶段检测模型在表格结构识别任务中成功的关键设计要素，包括多类别问题定义、锚框生成的长宽比设置以及骨干网络的特征生成。我们采用简单方法改进了Cascade R-CNN模型的这些方面，在SciTSR、FinTabNet和PubTables1M数据集上仅结构TEDS指标分别提升19.32%、11.56%和14.77%，达到了当前最优性能。