Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes. Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem from the perspective of natural language processing. We introduce a new framework that frames CTBD as a graph generation problem. This methodology consists of two essential procedures: identifying individual text units as graph nodes and discerning the sequential reading order relationships among these units as graph edges. Leveraging the cutting-edge capabilities of DQ-DETR for node detection, our framework innovates further by integrating a novel mechanism, a Dynamic Relation Transformer (DRFormer), dedicated to edge generation. DRFormer incorporates a dual interactive transformer decoder that deftly manages a dynamic graph structure refinement process. Through this iterative process, the model systematically enhances the graph's fidelity, ultimately resulting in improved precision in detecting contextual text blocks. Comprehensive experimental evaluations conducted on both SCUT-CTW-Context and ReCTS-Context datasets substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our graph generation framework in advancing the field of CTBD.
翻译:上下文文本块检测(CTBD)是一项在自然场景的复杂性中识别连贯文本块的任务。以往的方法要么将CTBD视为计算机视觉中的视觉关系提取挑战,要么从自然语言处理的角度将其作为序列建模问题处理。我们提出了一种新框架,将CTBD定义为图生成问题。该方法包含两大核心步骤:将单个文本单元识别为图节点,并确定这些单元之间的顺序阅读顺序关系作为图边。利用DQ-DETR在节点检测中的前沿能力,我们的框架进一步创新性地集成了一种新型机制——动态关系Transformer(DRFormer),专门用于边生成。DRFormer采用双交互式Transformer解码器,能够灵活管理动态图结构的优化过程。通过这一迭代过程,模型系统地提升了图的保真度,最终提高了上下文文本块检测的精度。在SCUT-CTW-Context和ReCTS-Context数据集上进行的全面实验评估证实,我们的方法达到了最先进的性能,凸显了该图生成框架在推进CTBD领域的有效性和潜力。