Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
翻译:自深度学习兴起以来,场景文本检测模型取得了显著进展,但旨在将检测到的文本实例分组为段落的场景文本版面分析却未能同步发展。以往研究要么使用独立模型分别处理文本检测与分组,要么采用统一模型从头训练。这些方法均未能充分利用已成熟的文本检测器与易于获取的检测数据集。本文提出文本分组适配器 (Text Grouping Adapter, TGA)——一种能够利用各类预训练文本检测器学习版面分析的模块,使我们可直接使用经过充分训练的文本检测器,或仅对其进行高效微调。TGA兼容多种文本检测器架构,将检测到的文本区域与图像特征作为通用输入,以聚合文本实例特征。为捕获更广泛的版面分析上下文信息,我们提出通过一对多分配从文本实例特征预测文本分组掩码。综合实验表明,即使冻结预训练模型,将TGA集成至各类预训练文本检测器与文本定位器后,仍可获得优越的版面分析性能,同时继承预训练模型的通用文本检测能力。在全参数微调场景下,可进一步提升版面分析性能。