Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images

This paper takes an important step in bridging the performance gap between DETR and R-CNN for graphical object detection. Existing graphical object detection approaches have enjoyed recent enhancements in CNN-based object detection methods, achieving remarkable progress. Recently, Transformer-based detectors have considerably boosted the generic object detection performance, eliminating the need for hand-crafted features or post-processing steps such as Non-Maximum Suppression (NMS) using object queries. However, the effectiveness of such enhanced transformer-based detection algorithms has yet to be verified for the problem of graphical object detection. Essentially, inspired by the latest advancements in the DETR, we employ the existing detection transformer with few modifications for graphical object detection. We modify object queries in different ways, using points, anchor boxes and adding positive and negative noise to the anchors to boost performance. These modifications allow for better handling of objects with varying sizes and aspect ratios, more robustness to small variations in object positions and sizes, and improved image discrimination between objects and non-objects. We evaluate our approach on the four graphical datasets: PubTables, TableBank, NTable and PubLaynet. Upon integrating query modifications in the DETR, we outperform prior works and achieve new state-of-the-art results with the mAP of 96.9\%, 95.7\% and 99.3\% on TableBank, PubLaynet, PubTables, respectively. The results from extensive ablations show that transformer-based methods are more effective for document analysis analogous to other applications. We hope this study draws more attention to the research of using detection transformers in document image analysis.

翻译：本文在弥合DETR与R-CNN在图形目标检测中的性能差距方面迈出了重要一步。现有图形目标检测方法借助基于CNN的目标检测技术的最新进展，已取得显著成果。近年来，基于Transformer的检测器通过使用目标查询（object queries）消除了手工特征设计及非极大值抑制（NMS）等后处理步骤，大幅提升了通用目标检测性能。然而，此类增强型Transformer检测算法在图形目标检测问题中的有效性尚未得到验证。受DETR最新进展启发，本文直接采用现有检测Transformer架构，通过少量修改实现图形目标检测。我们以不同方式修改目标查询：使用点、锚框（anchor boxes），并向锚框添加正负噪声以提升性能。这些修改使模型能更好处理不同尺寸和纵横比的目标，增强对目标位置和尺寸微小变化的鲁棒性，并改善图像中目标与非目标的判别能力。我们在四个图形数据集（PubTables、TableBank、NTable和PubLaynet）上评估方法。在DETR中集成查询修改后，我们在TableBank、PubLaynet和PubTables上分别取得了96.9%、95.7%和99.3%的mAP新最佳结果。大量消融实验表明，与其他应用类似，基于Transformer的方法在文档分析中更为有效。本研究有望推动检测Transformer在文档图像分析领域的进一步研究。